Doomscrolling through my news feed, I see a pile-up of global and national crises, decades and centuries in the making. Perhaps the most pressing crises we face today are ecological. Everything we have stems from life on this planet. Even if we go to Mars, there are no economies, no politics, no humanity without a biosphere from which to derive them. Yet, it seems that humanity is all that stands between us and a mostly humane, non-catastrophic future, and that window of agency may be rapidly closing.
That's why, given this course's choice of datasets, I chose to investigate GapMinder's. GapMinder pulls data from various sources about the global state of affairs. It is a self-ascribed "fact tank, not a think tank" with facts about things like economics, education, population, environment, etc. Specifically, I chose to select indicators to look into the relationships between democracy, education, and equity.
The main purpose of this project is to extend and showcase my exploratory data wrangling and Python skills, not to produce any rigorous research or analysis.
While I won't be demonstrating any causal correlation between these broad categories, my hunch going into the investigation was that there exists an interdependent relationship among all three axises (democracy, education, and equity): an equitable democracy depends on educated citizens; inequity inhibits proper education of the citizenry as a whole and tips the scale of the democratic process; democratic engagement in the public interest ensures equity writ large and equity in educational opportunity. In other words, there is no x factor. And, in such a systemic relationship, you may find that you can affect all parts by affect one part, and you may also find that you can't significantly affect one part without affecting them all.
Testing these huge assertions is well beyond the scope of this project, but a reasonable starting point is to ask whether we can find any such correlation between specific measures of democracy, education, and equity. That is what I ask here.
What I found was that there were far more non-correlated indicator pairs across axises than there were correlated, despite overlap in the constructs and indicators (e.g. some measures of democracy include measures of equity, by definition). In an effort to limit this leakage between constructs, I limited indicators on the equity axis to indicators socioeconomic equity. The outcome was that democracy was the most self-correlated axis of indicators, while socioeconomic equity was the least self-correlated axis, indicating the relative stability and coherence of the constructs and selection of indicators in these axises. The democracy and education axises were most correlated. No rigorous methodology was applied to the selection, grouping, and analysis of these indicators.
For a second research inquiry, I wanted to know what indicators do correlated strongly across axises. I selected three indicators that correlated strongly with multiple indicators across axises. They were measures of gender equity, government functioning, and poverty. I found that there existed correlations between them.
Measures of democracy, education, and equity are many. They are also often overlapping, causing leakage. Some measures of democracy include measures of gender equity, for instance, and some measures of equity include measures of educational access.
I mitigated leakage primarily by narrowing equity to socioeconomic equity, using indicators such as distribution of wealth and income. I didn't include health and healthcare indicators here, though they are arguably within the domain of socioeconomic equity, in order to avoid overlap with democratic indicators and other possible constructs beyond the three axises I have chosen. I did however include some measures of unemployment and government insurance because they are more directly tied to income and socioeconomic security, but mostly out of curiosity.
Gender parity is a huge theme in Gapminder datasets. I opted to exclude indicators that measured in terms of one sex or the other, and included only measures of the population as a whole or age groups. I did this to limit my focus on socioeconomic equity, and to avoid overlap with democracy. I included at least one broad indicator of gender equity in general as an indicator of democracy.
A more rigorous selection and labeling process is certainly warranted by the research question, but not by the project scope and assignment rubric.
Holding the reigns of my education, I chose a big enough question to hold my interest, and to extend and strengthen my Python skills. I have knowingly set some statistical analysis tasks that might be better accomplished with other methods, and indeed might be considered reinventing the wheel. But, in doing so, I have deepened my statistical understanding and improved my Python and coding skills in such a way that I am more ready to move onto more sophisticated methods and tools.
import os
import pandas as pd
import numpy as np
from scipy import stats
from enum import Enum
%matplotlib notebook
from matplotlib import pyplot as plt
import seaborn as sns
import cufflinks as cf
import plotly.offline
cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)
import plotly.express as px
from pandas_profiling import ProfileReport
import pprint as pp
# To produce all output in a cell, not just last output
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# To revert:
# InteractiveShell.ast_node_interactivity = "last_expr"
I want to work with a lot of different tables from the Gapminder set, so rather than reading them into dataframes one by one, I downloaded them to a folder and wrote a function to read them into a dictionary of dataframes.
There may have been an easier way to download the tables* in bulk, but I wanted to browse through them on the site and choose which to download anyway, so it was just as easy to download them one at a time. However, this choice leaves all metadata on the Gapminder site, so looking up descriptions of each indicator has to be done online. A quick internet search for the file name (without file type suffix) typically turns up the original data source. For this project, this more than suffices.
I also needed to label these indicator tables as belonging to one of my target axises: democracy, education, and socioeconomic equity. One approach might be to create an indicator class that has a dataframe and a label, but I want to put the indicators in a table for easier grouping by their label. So, I created a pandas series of labels, indexed by indicator names (file name without the file type suffix). I created a dictionary of dataframes for the indicator tables, with keys identical to the label series index. When I want to select dataframes by their labels, I can select the keys from the label series index.
I orginally wanted to put the indicator dataframes into a series as well and join it with the labels series into a dataframe. But, I ran into problems with operating on the dataframes within the dataframe (e.g. adjusting for population). Plus, dict autocompletes key values, whereas pandas doesn't autocomplete index or column names. I might lose a marginal amount of processing speed with a dict, and perhaps some coding maneuverability, but it's a compromise I can live with.
* Note: I use "tables" generically to refer to 2D data structures with rows and columns. The table I'm talking about might be a CSV file, a pandas dataframe, a numpy array, etc., or even a pandas series.
def read_files_in_folder(*args: str, fp_list: str, path: str = None,
columns_name: str = None) -> pd.DataFrame:
'''Reads a folder of csv tables into a dictionary of dataframes.
If optional pd.read_csv arguments are passed (*args), dynamically writes
a script to a file and runs the script.
Parameters:
*args are str: (optional) pd.read_csv parameters.
fp_list is [str]: list of filenames or filepaths of csv files.
path is str: (optional) filepath str filenames. os.curdir if None.
columns_name: (optional) str to name the columns names row.
Returns:
Dictionary of dataframes created from csv files.'''
df_dict = {}
if path is None:
path = os.curdir
# If pd.read_csv statement excludes optional arguments,
# may execute without writing to file.
if args is None:
for fp in fp_list:
fp_var_name = fp.split('/')[-1].split('.')[0]
df_dict[fp_var_name] = pd.read_csv(path + fp)
# If pd.read_csv statement includes optional arguments,
# must write to file to execute.
else:
argslist = ', '.join(args)
with open('script.py', 'w') as file:
file.write("""
import pandas as pd
def csvs_to_df_dict():
\tscript_df_dict = {}
""")
for fp in fp_list:
fp_var_name = fp.split('/')[-1].split('.')[0]
statement = "\tscript_df_dict['" + fp_var_name + \
"'] = pd.read_csv('" + path + fp + "', " + \
argslist + ")\n"
file.write(statement)
file.write("\tscript_df_dict['" + fp_var_name + \
"'].columns.name = 'year'\n")
file.write('\treturn script_df_dict')
# Run the script to compile the function that performs the reads.
import script
# Run the function to create the dictionary of dataframes.
df_dict = script.csvs_to_df_dict()
return df_dict
# Location and names of csv files to load.
csv_path = os.curdir + '/csv_tables/'
filename_list = os.listdir(path=csv_path)
# Dictionary of indicators/tables as dataframes from csv files.
df_dict = read_files_in_folder('index_col=0',
fp_list=filename_list, path=csv_path)
# Deleting unnecessary variables/functions to free memory for larger objects.
del read_files_in_folder
del csv_path
del filename_list
Scrolling through the dictionary of tables, it appears that all tables have years for columns and countries for an index. They all appear to be two-dimensional and contain numerical values. They don't all share the same shape, some having more years and/or countries than others.
We do find that tables from the same organization tend to have the same shape. For instance, the Economic Intelligence Unit (EIU) democracy indicators* share the same countries and years, though some records may contain missing values.
* EIU calls them indices, but I don't want to confuse terms when speaking of table indices, so I call them indicators -- though the words share a root.
# df_dict
I created a better describe table, actually two. I add pandas.DataFrame.skew and pandas.DataFrame.kurtosis to the pandas.DataFrame.describe dataframe. To create a second table of summary stats of the summary stats (to compare columns to the table as a whole), I transpose the augmented describe table and describe it. It's not so necessary here, but I use it later in the analysis. I'll test drive it here to check out a couple of democratic indicators from EIU.
def get_better_desc(df, recursions: int = 1, percentiles=None, include=None,
exclude=None, datetime_is_numeric: bool = False):
'''Adds kurtosis and skew to pandas.DataFrame.describe output. And, creates
second transposed version of this table called on itself for summary stats
of summary stats.
Parameters:
df: pandas.DataFrame, or Series but its super_desc isn't so meaningful.
recursions: integer number of times to apply recursively to create
super_desc. Default value of 1 is all that is necessary.
(percentiles=None, include=None, exclude=None,
datetime_is_numeric: bool = False): optional parameters to pass to
pandas.DataFrame.describe.
Returns:
better_desc: pandas.Dataframe (or Series) with kurtosis and skew added.
super_desc: pandas.DataFrame (or Series) of better_desc transposed and
made into a better_desc itself.'''
kurt = df.kurtosis()
kurt.name = 'kurt'
skew = df.skew()
skew.name = 'skew'
better_desc = df.describe(percentiles=percentiles, include=include,
exclude=exclude,
datetime_is_numeric=datetime_is_numeric\
).append([kurt, skew])
### Try with concat instead of append for speed?
if recursions > 0:
super_desc = get_better_desc(better_desc.transpose(),
recursions=(recursions - 1))[0]
else:
super_desc = better_desc
return better_desc, super_desc
df_dict['polpartix_eiu']
get_better_desc(df_dict['polpartix_eiu'])[0] # better_desc
get_better_desc(df_dict['polpartix_eiu'])[1] # super_desc
df_dict['gvtx_eiu']
get_better_desc(df_dict['gvtx_eiu'])[0] # better_desc
get_better_desc(df_dict['gvtx_eiu'])[1] # super_desc
| year | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| country | |||||||||||||
| Afghanistan | 0.222 | 0.222 | 0.222 | 0.250 | 0.278 | 0.278 | 0.278 | 0.278 | 0.389 | 0.389 | 0.278 | 0.278 | 0.444 |
| Albania | 0.444 | 0.444 | 0.444 | 0.444 | 0.444 | 0.500 | 0.500 | 0.500 | 0.500 | 0.556 | 0.556 | 0.556 | 0.556 |
| Algeria | 0.222 | 0.194 | 0.167 | 0.223 | 0.278 | 0.278 | 0.389 | 0.389 | 0.389 | 0.389 | 0.389 | 0.389 | 0.389 |
| Angola | 0.111 | 0.250 | 0.389 | 0.416 | 0.444 | 0.444 | 0.500 | 0.500 | 0.500 | 0.500 | 0.556 | 0.556 | 0.556 |
| Argentina | 0.556 | 0.556 | 0.556 | 0.556 | 0.556 | 0.556 | 0.556 | 0.556 | 0.556 | 0.611 | 0.611 | 0.611 | 0.611 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| Venezuela | 0.556 | 0.556 | 0.556 | 0.556 | 0.556 | 0.556 | 0.556 | 0.556 | 0.556 | 0.500 | 0.556 | 0.611 | 0.444 |
| Vietnam | 0.278 | 0.223 | 0.167 | 0.250 | 0.333 | 0.278 | 0.278 | 0.389 | 0.389 | 0.389 | 0.389 | 0.389 | 0.389 |
| Yemen | 0.278 | 0.306 | 0.333 | 0.361 | 0.389 | 0.444 | 0.500 | 0.500 | 0.500 | 0.444 | 0.444 | 0.444 | 0.389 |
| Zambia | 0.333 | 0.333 | 0.333 | 0.361 | 0.389 | 0.444 | 0.444 | 0.444 | 0.444 | 0.389 | 0.389 | 0.389 | 0.389 |
| Zimbabwe | 0.389 | 0.389 | 0.389 | 0.361 | 0.333 | 0.389 | 0.333 | 0.333 | 0.389 | 0.389 | 0.389 | 0.444 | 0.444 |
164 rows × 13 columns
| year | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 164.000000 | 164.000000 | 164.000000 | 164.000000 | 164.000000 | 164.000000 | 164.000000 | 164.000000 | 164.000000 | 164.000000 | 164.000000 | 164.000000 | 164.000000 |
| mean | 0.444488 | 0.451113 | 0.457713 | 0.457405 | 0.456994 | 0.471896 | 0.483073 | 0.494274 | 0.502409 | 0.508841 | 0.510165 | 0.513213 | 0.523738 |
| std | 0.214436 | 0.210027 | 0.209261 | 0.201007 | 0.197480 | 0.188580 | 0.185739 | 0.184204 | 0.185310 | 0.187705 | 0.188527 | 0.190476 | 0.189113 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.056000 | 0.111000 | 0.111000 | 0.111000 | 0.111000 | 0.111000 | 0.111000 | 0.111000 |
| 25% | 0.278000 | 0.278000 | 0.278000 | 0.306000 | 0.333000 | 0.333000 | 0.333000 | 0.389000 | 0.389000 | 0.389000 | 0.389000 | 0.389000 | 0.389000 |
| 50% | 0.444000 | 0.444000 | 0.444000 | 0.458500 | 0.444000 | 0.500000 | 0.500000 | 0.500000 | 0.500000 | 0.500000 | 0.500000 | 0.500000 | 0.556000 |
| 75% | 0.611000 | 0.611000 | 0.611000 | 0.584000 | 0.611000 | 0.611000 | 0.611000 | 0.611000 | 0.667000 | 0.667000 | 0.667000 | 0.667000 | 0.667000 |
| max | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
| kurt | -0.401575 | -0.371022 | -0.357456 | -0.286012 | -0.280298 | -0.238225 | -0.281322 | -0.275295 | -0.395989 | -0.599341 | -0.594755 | -0.602841 | -0.640992 |
| skew | 0.178590 | 0.204267 | 0.169831 | 0.202294 | 0.191543 | 0.183646 | 0.215948 | 0.181828 | 0.182400 | 0.097572 | 0.054581 | 0.021340 | -0.020561 |
| count | mean | std | min | 25% | 50% | 75% | max | kurt | skew | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 13.0 | 13.000000 | 13.000000 | 13.000000 | 13.000000 | 13.000000 | 13.000000 | 13.0 | 13.000000 | 13.000000 |
| mean | 164.0 | 0.482717 | 0.194759 | 0.064077 | 0.344077 | 0.483885 | 0.630462 | 1.0 | -0.409625 | 0.143329 |
| std | 0.0 | 0.027500 | 0.010552 | 0.054778 | 0.047456 | 0.034188 | 0.030937 | 0.0 | 0.147614 | 0.078038 |
| min | 164.0 | 0.444488 | 0.184204 | 0.000000 | 0.278000 | 0.444000 | 0.584000 | 1.0 | -0.640992 | -0.020561 |
| 25% | 164.0 | 0.457405 | 0.187705 | 0.000000 | 0.306000 | 0.444000 | 0.611000 | 1.0 | -0.594755 | 0.097572 |
| 50% | 164.0 | 0.483073 | 0.189113 | 0.111000 | 0.333000 | 0.500000 | 0.611000 | 1.0 | -0.371022 | 0.181828 |
| 75% | 164.0 | 0.508841 | 0.201007 | 0.111000 | 0.389000 | 0.500000 | 0.667000 | 1.0 | -0.281322 | 0.191543 |
| max | 164.0 | 0.523738 | 0.214436 | 0.111000 | 0.389000 | 0.556000 | 0.667000 | 1.0 | -0.238225 | 0.215948 |
| kurt | 0.0 | -1.688946 | -0.770078 | -2.116716 | -1.649888 | -0.086545 | -1.733369 | 0.0 | -1.397434 | 0.078974 |
| skew | 0.0 | 0.035507 | 0.892900 | -0.355186 | -0.350943 | 0.330727 | 0.306039 | 0.0 | -0.580095 | -1.210462 |
| year | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| country | |||||||||||||
| Afghanistan | 0.000 | 0.0395 | 0.079 | 0.079 | 0.079 | 0.079 | 0.079 | 0.079 | 0.114 | 0.114 | 0.114 | 0.114 | 0.114 |
| Albania | 0.507 | 0.5070 | 0.507 | 0.507 | 0.507 | 0.471 | 0.400 | 0.400 | 0.400 | 0.436 | 0.436 | 0.471 | 0.471 |
| Algeria | 0.221 | 0.2210 | 0.221 | 0.221 | 0.221 | 0.221 | 0.221 | 0.221 | 0.221 | 0.221 | 0.221 | 0.221 | 0.221 |
| Angola | 0.214 | 0.2680 | 0.321 | 0.321 | 0.321 | 0.321 | 0.321 | 0.321 | 0.321 | 0.321 | 0.321 | 0.286 | 0.286 |
| Argentina | 0.500 | 0.5000 | 0.500 | 0.535 | 0.571 | 0.571 | 0.571 | 0.571 | 0.571 | 0.500 | 0.500 | 0.500 | 0.536 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| Venezuela | 0.364 | 0.3960 | 0.429 | 0.411 | 0.393 | 0.393 | 0.429 | 0.429 | 0.429 | 0.393 | 0.250 | 0.286 | 0.179 |
| Vietnam | 0.429 | 0.4290 | 0.429 | 0.429 | 0.429 | 0.429 | 0.393 | 0.393 | 0.393 | 0.393 | 0.321 | 0.321 | 0.321 |
| Yemen | 0.271 | 0.2610 | 0.250 | 0.214 | 0.179 | 0.036 | 0.143 | 0.143 | 0.143 | 0.036 | 0.000 | 0.000 | 0.000 |
| Zambia | 0.464 | 0.4640 | 0.464 | 0.500 | 0.536 | 0.500 | 0.536 | 0.536 | 0.536 | 0.536 | 0.536 | 0.500 | 0.464 |
| Zimbabwe | 0.079 | 0.0790 | 0.079 | 0.104 | 0.129 | 0.129 | 0.129 | 0.129 | 0.129 | 0.200 | 0.200 | 0.200 | 0.200 |
164 rows × 13 columns
| year | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 164.000000 | 164.000000 | 164.000000 | 164.000000 | 164.000000 | 164.000000 | 164.000000 | 164.000000 | 164.000000 | 164.000000 | 164.000000 | 164.000000 | 164.000000 |
| mean | 0.498305 | 0.499293 | 0.500220 | 0.500875 | 0.501488 | 0.496713 | 0.497561 | 0.491256 | 0.491909 | 0.490793 | 0.488390 | 0.488689 | 0.486518 |
| std | 0.254563 | 0.252564 | 0.252777 | 0.250618 | 0.251914 | 0.250947 | 0.251403 | 0.249236 | 0.245754 | 0.250070 | 0.250682 | 0.252689 | 0.252842 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.307000 | 0.299500 | 0.321000 | 0.304000 | 0.321000 | 0.286000 | 0.286000 | 0.286000 | 0.286000 | 0.314000 | 0.291250 | 0.286000 | 0.286000 |
| 50% | 0.500000 | 0.500000 | 0.500000 | 0.500000 | 0.500000 | 0.500000 | 0.500000 | 0.500000 | 0.503500 | 0.521500 | 0.536000 | 0.521500 | 0.500000 |
| 75% | 0.679000 | 0.697000 | 0.687750 | 0.697000 | 0.714000 | 0.714000 | 0.714000 | 0.687750 | 0.714000 | 0.687750 | 0.687750 | 0.679000 | 0.679000 |
| max | 1.000000 | 1.000000 | 1.000000 | 0.982000 | 0.964000 | 0.964000 | 0.964000 | 0.964000 | 0.964000 | 0.964000 | 0.964000 | 0.964000 | 0.964000 |
| kurt | -0.875046 | -0.842840 | -0.825594 | -0.865636 | -0.908389 | -0.873648 | -0.877569 | -0.843185 | -0.788275 | -0.732188 | -0.785522 | -0.791111 | -0.786893 |
| skew | 0.010703 | 0.010284 | -0.006151 | -0.030309 | -0.074909 | -0.054627 | -0.052383 | -0.066381 | -0.038381 | -0.135973 | -0.107633 | -0.093703 | -0.080519 |
| count | mean | std | min | 25% | 50% | 75% | max | kurt | skew | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 13.0 | 13.000000 | 13.000000 | 13.0 | 13.000000 | 13.000000 | 13.000000 | 13.000000 | 13.000000 | 13.000000 |
| mean | 164.0 | 0.494770 | 0.251235 | 0.0 | 0.297981 | 0.506346 | 0.695231 | 0.973692 | -0.830454 | -0.055383 |
| std | 0.0 | 0.005313 | 0.002171 | 0.0 | 0.013973 | 0.011935 | 0.014229 | 0.015787 | 0.050518 | 0.044589 |
| min | 164.0 | 0.486518 | 0.245754 | 0.0 | 0.286000 | 0.500000 | 0.679000 | 0.964000 | -0.908389 | -0.135973 |
| 25% | 164.0 | 0.490793 | 0.250618 | 0.0 | 0.286000 | 0.500000 | 0.687750 | 0.964000 | -0.873648 | -0.080519 |
| 50% | 164.0 | 0.496713 | 0.251403 | 0.0 | 0.291250 | 0.500000 | 0.687750 | 0.964000 | -0.842840 | -0.054627 |
| 75% | 164.0 | 0.499293 | 0.252689 | 0.0 | 0.307000 | 0.503500 | 0.714000 | 0.982000 | -0.788275 | -0.030309 |
| max | 164.0 | 0.501488 | 0.254563 | 0.0 | 0.321000 | 0.536000 | 0.714000 | 1.000000 | -0.732188 | 0.010703 |
| kurt | 0.0 | -1.662755 | 2.671175 | 0.0 | -1.174436 | 2.144544 | -1.515023 | -0.550727 | -0.596876 | -0.565659 |
| skew | 0.0 | -0.208662 | -1.195944 | 0.0 | 0.680097 | 1.779395 | 0.408782 | 1.176454 | 0.354103 | -0.039048 |
# Indicator labels.
class Ind_Axis(Enum):
DEM = 1
SOCEQ = 2
EDU = 3
# Create a series of indicator labels
labels_sr = pd.Series(data=[Ind_Axis.DEM, Ind_Axis.SOCEQ, Ind_Axis.SOCEQ,
Ind_Axis.SOCEQ, Ind_Axis.SOCEQ, Ind_Axis.SOCEQ, Ind_Axis.DEM,
Ind_Axis.SOCEQ, Ind_Axis.DEM, Ind_Axis.DEM, Ind_Axis.DEM, Ind_Axis.EDU,
Ind_Axis.DEM, Ind_Axis.DEM, Ind_Axis.DEM, Ind_Axis.DEM, Ind_Axis.DEM,
Ind_Axis.DEM, Ind_Axis.DEM, Ind_Axis.SOCEQ, Ind_Axis.DEM, Ind_Axis.DEM,
Ind_Axis.DEM, Ind_Axis.DEM, Ind_Axis.DEM, Ind_Axis.DEM, Ind_Axis.DEM,
Ind_Axis.DEM, Ind_Axis.DEM, Ind_Axis.DEM, Ind_Axis.SOCEQ, Ind_Axis.SOCEQ,
Ind_Axis.DEM, Ind_Axis.SOCEQ, Ind_Axis.DEM, Ind_Axis.DEM, Ind_Axis.SOCEQ,
Ind_Axis.SOCEQ, Ind_Axis.SOCEQ, Ind_Axis.SOCEQ, Ind_Axis.SOCEQ,
Ind_Axis.SOCEQ, Ind_Axis.SOCEQ, Ind_Axis.SOCEQ, Ind_Axis.SOCEQ,
Ind_Axis.DEM, Ind_Axis.DEM, Ind_Axis.DEM, Ind_Axis.DEM, Ind_Axis.EDU,
Ind_Axis.EDU, Ind_Axis.DEM, Ind_Axis.SOCEQ, Ind_Axis.EDU, Ind_Axis.EDU,
Ind_Axis.DEM, Ind_Axis.SOCEQ, Ind_Axis.SOCEQ, Ind_Axis.SOCEQ,
Ind_Axis.SOCEQ, Ind_Axis.SOCEQ, Ind_Axis.SOCEQ, Ind_Axis.SOCEQ,
Ind_Axis.SOCEQ, Ind_Axis.SOCEQ, Ind_Axis.SOCEQ, Ind_Axis.SOCEQ,
Ind_Axis.SOCEQ, Ind_Axis.SOCEQ, Ind_Axis.SOCEQ, Ind_Axis.SOCEQ,
Ind_Axis.SOCEQ, Ind_Axis.SOCEQ, Ind_Axis.SOCEQ, Ind_Axis.SOCEQ,
Ind_Axis.SOCEQ, Ind_Axis.SOCEQ, Ind_Axis.SOCEQ, Ind_Axis.SOCEQ,
Ind_Axis.SOCEQ, Ind_Axis.SOCEQ, Ind_Axis.SOCEQ, Ind_Axis.SOCEQ,
Ind_Axis.SOCEQ, Ind_Axis.DEM, Ind_Axis.DEM, Ind_Axis.DEM, Ind_Axis.DEM,
Ind_Axis.EDU, Ind_Axis.DEM, Ind_Axis.EDU, Ind_Axis.EDU, Ind_Axis.EDU,
Ind_Axis.EDU, Ind_Axis.EDU, Ind_Axis.EDU, Ind_Axis.EDU, Ind_Axis.EDU,
Ind_Axis.EDU, Ind_Axis.EDU,Ind_Axis.EDU, Ind_Axis.EDU, Ind_Axis.EDU,
Ind_Axis.EDU, Ind_Axis.EDU, Ind_Axis.EDU, Ind_Axis.EDU, Ind_Axis.EDU,
Ind_Axis.EDU, Ind_Axis.EDU, Ind_Axis.EDU, Ind_Axis.EDU, Ind_Axis.EDU,
Ind_Axis.EDU, Ind_Axis.EDU, Ind_Axis.EDU, Ind_Axis.EDU, Ind_Axis.EDU,
Ind_Axis.EDU, Ind_Axis.EDU, Ind_Axis.EDU, Ind_Axis.EDU, Ind_Axis.EDU,
Ind_Axis.EDU, Ind_Axis.EDU, Ind_Axis.EDU, Ind_Axis.EDU, Ind_Axis.EDU,
Ind_Axis.EDU, Ind_Axis.EDU, Ind_Axis.EDU, Ind_Axis.EDU, Ind_Axis.EDU,
Ind_Axis.EDU, Ind_Axis.EDU, Ind_Axis.DEM, Ind_Axis.SOCEQ, Ind_Axis.SOCEQ,
Ind_Axis.SOCEQ, Ind_Axis.SOCEQ, Ind_Axis.SOCEQ, Ind_Axis.SOCEQ,
Ind_Axis.SOCEQ, Ind_Axis.SOCEQ, Ind_Axis.SOCEQ, Ind_Axis.SOCEQ,
Ind_Axis.SOCEQ, Ind_Axis.SOCEQ, Ind_Axis.SOCEQ, Ind_Axis.SOCEQ,
Ind_Axis.SOCEQ, Ind_Axis.SOCEQ, Ind_Axis.SOCEQ, Ind_Axis.SOCEQ,
Ind_Axis.SOCEQ, Ind_Axis.SOCEQ, Ind_Axis.SOCEQ, Ind_Axis.SOCEQ,
Ind_Axis.SOCEQ, Ind_Axis.SOCEQ, Ind_Axis.SOCEQ, Ind_Axis.SOCEQ,
Ind_Axis.SOCEQ, Ind_Axis.DEM],
index=df_dict.keys(), name='ind_axis')
labels_sr.index.name = 'indicator'
labels_sr
indicator
abscorrup_idea Ind_Axis.DEM
aged_15plus_unemployment_rate_percent Ind_Axis.SOCEQ
aged_15_24_unemployment_rate_percent Ind_Axis.SOCEQ
aged_25_54_unemployment_rate_percent Ind_Axis.SOCEQ
aged_55_64_unemployment_rate_percent Ind_Axis.SOCEQ
...
sl_uem_intm_zs Ind_Axis.SOCEQ
sl_uem_neet_zs Ind_Axis.SOCEQ
sl_uem_totl_ne_zs Ind_Axis.SOCEQ
sl_uem_totl_zs Ind_Axis.SOCEQ
sorigeq_idea Ind_Axis.DEM
Name: ind_axis, Length: 164, dtype: object
labels_sr.value_counts()
Ind_Axis.SOCEQ 75 Ind_Axis.EDU 51 Ind_Axis.DEM 38 Name: ind_axis, dtype: int64
Labeling all the indicators was tedious, but there was no way around it. I have classified most of the indicators as indicators of socioeconomic equity.
Most of the tables are scores or percentages (e.g. percentage of wealth shared by the poorest 20%), in the 0-1 and 0-100 ranges, which makes for easy correlation with each other. Other tables are gross populations (e.g. total number of children enrolled in primary school), which also makes for easy correlation with each other. But, correlation between the groups requires controlling for population. So, I need to determine which tables are gross populations and convert them to a percentage of population.
There is a population table spanning from the 1800s to the end of this century. When selecting the tables, I carelessly did not note which tables are gross populations, so I want a quicker way to find out which tables need adjustment without going back and manually looking at each one. To do this, I can find all the tables with max values above 100. There might be some gross tables that don't meet this criteria, so I can isolate those tables with max values below 100 and are not 1 to manually inspect to decide which group they fit into.
Of these tables, some values are actual population number, and others are stated as millions of people. Since I'm only going to be looking at correlations, this might not pose an issue. But, I am concerned that the extremely small values that will result from adjusting for population might push the limits of float type accuracy. So, I will need to identify which gross population tables are not actual population numbers and adjust accordingly.
With that adjustment, all tables should have values relative to their country and year, and thus we can draw correlations.
There are many NaNs and some zeros in a lot of the tables. I chose to leave zeros as zeros when reading the files into dataframes because many of the tables contain both NaNs and zeros. This, coupled with the fact that these tables come from Gapminder where they have already been cleaned up, suggests that each NaN and zero is intentional. I may need to handle missing values later in the analysis.
That said, some tables have no NaNs and some zeros, some with many zeros. That raises some concerns as the many different sources may have used different methods for handling missing values, and Gapminder may not have applied a standard. Some tables may represent missing values with zeros, and others may use NaNs. Or, there may simply be tables with no missing values. I have not found any guidance on this on the Gapminder site. If I needed to be certain, I could contact someone at Gapminder or look at how others have handled this. For now, I will just keep an eye out for anything fishy, like indicators that don't comp well with any other indicators at all.
I'll start by isolating gross population tables to adjust them to proportions of population.
# Get a list of tables that are gross population tables
def get_max_lists(df_dict: dict) -> list:
'''Iterates through dictionary of dataframes.
Returns:
max_over_100: list of keys for tables with a max value over 100.
max_under100_not1: list of keys for tables with a max value under 100
and not equal to 1.'''
# Tables with max values over 100
# These are almost definitely gross values.
max_over_100 = list()
# Tables with max values under 100 and not 1.
# These are likely gross values, but may be a score/percentage table
# without a value at the top of its possible range.
ambiguous_max = list()
for name, table in df_dict.items():
max_val = table.max().max()
if max_val > 100:
max_over_100.append(name)
elif max_val < 100 and max_val != 1:
ambiguous_max.append(name)
return max_over_100, ambiguous_max
max_over_100_lst, ambiguous_max_lst = get_max_lists(df_dict=df_dict)
max_over_100_lst
['children_out_of_school_primary', 'gc_xpn_totl_gd_zs', 'math_achievement_4th_grade', 'math_achievement_8th_grade', 'number_of_people_in_poverty', 'se_pre_enrl_tc_zs', 'se_pre_enrr', 'se_prm_cmpt_zs', 'se_prm_enrl', 'se_prm_enrr', 'se_prm_uner', 'se_sec_cmpt_lo_zs', 'se_sec_enrr', 'se_ter_enrl_tc_zs', 'se_ter_enrr', 'se_xpd_seco_pc_zs', 'se_xpd_tert_pc_zs']
ambiguous_max_lst
['aged_15plus_unemployment_rate_percent', 'aged_15_24_unemployment_rate_percent', 'aged_25_54_unemployment_rate_percent', 'aged_55_64_unemployment_rate_percent', 'aged_65plus_unemployment_rate_percent', 'alternative_poverty_percent_below_nationally_defined_poverty', 'corruption_perception_index_cpi', 'democracy_score_use_as_color', 'demox_eiu', 'dollar_billionaires_per_million_people', 'gc_xpn_comp_zs', 'gini', 'income_share_of_2nd_poorest_20percent', 'income_share_of_2nd_richest_20percent', 'income_share_of_middle_20percent', 'income_share_of_poorest_10percent', 'income_share_of_poorest_20percent', 'income_share_of_richest_10percent', 'income_share_of_richest_20percent', 'inequality_index_gini', 'iq_cpa_pres_xq', 'iq_cpa_soci_xq', 'iq_cpa_tran_xq', 'locdemo_idea', 'long_term_unemployment_rate_percent', 'per_allsp_adq_pop_tot', 'per_allsp_ben_q1_tot', 'per_allsp_cov_pop_tot', 'per_lm_alllm_adq_pop_tot', 'per_lm_alllm_ben_q1_tot', 'per_lm_alllm_cov_pop_tot', 'per_lm_alllm_cov_q1_tot', 'per_lm_alllm_cov_q2_tot', 'per_lm_alllm_cov_q3_tot', 'per_lm_alllm_cov_q4_tot', 'per_lm_alllm_cov_q5_tot', 'per_sa_allsa_adq_pop_tot', 'per_sa_allsa_ben_q1_tot', 'per_sa_allsa_cov_pop_tot', 'per_sa_allsa_cov_q1_tot', 'per_sa_allsa_cov_q2_tot', 'per_sa_allsa_cov_q3_tot', 'per_sa_allsa_cov_q5_tot', 'per_si_allsi_adq_pop_tot', 'per_si_allsi_ben_q1_tot', 'per_si_allsi_cov_pop_tot', 'per_si_allsi_cov_q1_tot', 'per_si_allsi_cov_q2_tot', 'per_si_allsi_cov_q3_tot', 'per_si_allsi_cov_q4_tot', 'per_si_allsi_cov_q5_tot', 'primary_completion_rate_total_percent_of_relevant_age_group', 'rgov_idea', 'se_com_durs', 'se_pre_durs', 'se_prm_ages', 'se_prm_durs', 'se_prm_oenr_zs', 'se_prm_priv_zs', 'se_prm_rept_zs', 'se_prm_uner_zs', 'se_sec_ages', 'se_sec_cuat_up_zs', 'se_sec_durs', 'se_sec_enrl_lo_tc_zs', 'se_sec_enrl_tc_zs', 'se_sec_nenr', 'se_sec_priv_zs', 'se_sec_uner_lo_zs', 'se_ter_cuat_ba_zs', 'se_ter_cuat_do_zs', 'se_ter_cuat_ms_zs', 'se_ter_cuat_st_zs', 'se_xpd_prim_pc_zs', 'se_xpd_totl_gb_zs', 'se_xpd_totl_gd_zs', 'si_dst_02nd_20', 'si_dst_03rd_20', 'si_dst_04th_20', 'si_dst_05th_20', 'si_dst_10th_10', 'si_dst_50md', 'si_dst_frst_10', 'si_dst_frst_20', 'si_pov_gini', 'si_pov_mdim', 'si_pov_mdim_17', 'si_pov_mdim_17_xq', 'si_pov_mdim_hh', 'si_pov_mdim_it', 'si_pov_mdim_xq', 'si_pov_nahc', 'si_spr_pc40', 'si_spr_pcap', 'sl_emp_vuln_zs', 'sl_uem_1524_ne_zs', 'sl_uem_1524_zs', 'sl_uem_advn_zs', 'sl_uem_basc_zs', 'sl_uem_intm_zs', 'sl_uem_neet_zs', 'sl_uem_totl_ne_zs', 'sl_uem_totl_zs']
Looking a few of these up online, I can see right off the bat that many of the ambiguous tables contain percentages. I can remove the names that contain 'per'. I can also see that some of these are range-bound scores. I can remove the names that contain 'index', 'score', and 'gini'.
def remove_names(name_lst: list, sub_lst: list = None,
remove_names_lst: list = None) -> list:
'''Returns a list of strings on name_lst that do not contain any of the
substrings in sub_lst, and without the items in remove_name_lst.
Parameters:
name_lst: list of strings to check for presence of substrings.
sub_lst: (optional) list of substrings to search for in each item of
name_lst.
remove_name_lst: (optional) list of strings to remove from name_lst
Returns:
keep_lst: list of strings from name_lst that do not contain any of the
substrings in sub_lst, and without the items in
remove_name_list. If neither list passed, name_lst returned.'''
keep_lst = list()
# Construct a list of names to remove (adding to given removal list if it
# exists).
if sub_lst is not None:
if remove_names_lst is None:
remove_names_lst = list()
for name in name_lst:
for string in sub_lst:
if string in name:
remove_names_lst.append(name)
break
# If you have anything to remove, construct the keeper list from everything
# on the name list that isn't on the removal list.
if remove_names_lst is not None:
keep_lst = [x for x in name_lst if x not in remove_names_lst]
# Return the input name list if you had nothing to remove.
else:
keep_lst = name_lst
return keep_lst
sub_lst = ['per', 'index', 'score', 'gini']
preculled_amb_lst = remove_names(name_lst=ambiguous_max_lst, sub_lst=sub_lst)
preculled_amb_lst
['demox_eiu', 'gc_xpn_comp_zs', 'iq_cpa_pres_xq', 'iq_cpa_soci_xq', 'iq_cpa_tran_xq', 'locdemo_idea', 'rgov_idea', 'se_com_durs', 'se_pre_durs', 'se_prm_ages', 'se_prm_durs', 'se_prm_oenr_zs', 'se_prm_priv_zs', 'se_prm_rept_zs', 'se_prm_uner_zs', 'se_sec_ages', 'se_sec_cuat_up_zs', 'se_sec_durs', 'se_sec_enrl_lo_tc_zs', 'se_sec_enrl_tc_zs', 'se_sec_nenr', 'se_sec_priv_zs', 'se_sec_uner_lo_zs', 'se_ter_cuat_ba_zs', 'se_ter_cuat_do_zs', 'se_ter_cuat_ms_zs', 'se_ter_cuat_st_zs', 'se_xpd_prim_pc_zs', 'se_xpd_totl_gb_zs', 'se_xpd_totl_gd_zs', 'si_dst_02nd_20', 'si_dst_03rd_20', 'si_dst_04th_20', 'si_dst_05th_20', 'si_dst_10th_10', 'si_dst_50md', 'si_dst_frst_10', 'si_dst_frst_20', 'si_pov_mdim', 'si_pov_mdim_17', 'si_pov_mdim_17_xq', 'si_pov_mdim_hh', 'si_pov_mdim_it', 'si_pov_mdim_xq', 'si_pov_nahc', 'si_spr_pc40', 'si_spr_pcap', 'sl_emp_vuln_zs', 'sl_uem_1524_ne_zs', 'sl_uem_1524_zs', 'sl_uem_advn_zs', 'sl_uem_basc_zs', 'sl_uem_intm_zs', 'sl_uem_neet_zs', 'sl_uem_totl_ne_zs', 'sl_uem_totl_zs']
Okay, this looks like it might be a lot of busy work, but there are really only a few groups to manually check. Looking up these tables online, I see a few other substrings indicating percentages/scores, and some indicating duration in years and ages.
sub_lst = ['xq', 'idea', 'zs', 'dst', 'mdim', 'si', 'durs', 'ages']
remove_names_lst = ['demox_eiu', 'se_sec_nenr']
gross_names_lst = remove_names(name_lst=preculled_amb_lst, sub_lst=sub_lst,
remove_names_lst=remove_names_lst)
gross_names_lst += max_over_100_lst
gross_names_lst
['children_out_of_school_primary', 'gc_xpn_totl_gd_zs', 'math_achievement_4th_grade', 'math_achievement_8th_grade', 'number_of_people_in_poverty', 'se_pre_enrl_tc_zs', 'se_pre_enrr', 'se_prm_cmpt_zs', 'se_prm_enrl', 'se_prm_enrr', 'se_prm_uner', 'se_sec_cmpt_lo_zs', 'se_sec_enrr', 'se_ter_enrl_tc_zs', 'se_ter_enrr', 'se_xpd_seco_pc_zs', 'se_xpd_tert_pc_zs']
Okay, that should be all the tables with gross or net values that are not percentages or scores or otherwise not population related. But, I happened to notice that se_sec_enrr, which I explicitly removed from the ambiguous list, was added in with max_over_100_lst. This table is an enrollment ratio table and shouldn't be adjusted for population. Plus, I recognize a removal substring in some of the names. I'm going to go ahead and manually look up the max_over_100_lst tables. There aren't many left.
gross_names_lst = remove_names(name_lst=gross_names_lst, sub_lst=['zs', 'enrr'],
remove_names_lst=['math_achievement_4th_grade',
'math_achievement_8th_grade'])
gross_names_lst
['children_out_of_school_primary', 'number_of_people_in_poverty', 'se_prm_enrl', 'se_prm_uner']
Now, I need to double check online to see if the values in each table are actual values, or values in millions -- or something else.
And, it looks like only one needs to be multiplied, number_of_people_in_poverty.
# Check a test value before.
print('Test value before:')
test_val_loc = df_dict['number_of_people_in_poverty']['1978'].loc\
[df_dict['number_of_people_in_poverty']['1978'] == \
df_dict['number_of_people_in_poverty']['1978'].max()]
test_val_loc
df_dict['number_of_people_in_poverty'] = \
df_dict['number_of_people_in_poverty'] * 1000000
# Check value after.
print('Test value after:')
test_val_loc = df_dict['number_of_people_in_poverty']['1978'].loc\
[df_dict['number_of_people_in_poverty']['1978'] == \
df_dict['number_of_people_in_poverty']['1978'].max()]
test_val_loc
Test value before:
country India 440.0 Name: 1978, dtype: float64
Test value after:
country India 440000000.0 Name: 1978, dtype: float64
Looks like it worked. Now, I need to adjust for population.
I'll change the keys of the tables I adjust in order to keep track of it. That means I'll need to change the index of indicator axis labels accordingly.
# Check sample value and total count before.
print('Shape of test table before:')
test_shape = df_dict['number_of_people_in_poverty'].shape
print('Test value before:')
test_val_loc = df_dict['number_of_people_in_poverty']['1978'].loc\
[df_dict['number_of_people_in_poverty']['1978'] == \
df_dict['number_of_people_in_poverty']['1978'].max()]
test_val_loc
print('Count of observations in test table:')
test_count = df_dict['number_of_people_in_poverty'].count().sum()
test_count
print('Label of test indicator:')
test_label = labels_sr['number_of_people_in_poverty']
test_label
Shape of test table before: Test value before:
country India 440000000.0 Name: 1978, dtype: float64
Count of observations in test table:
1099
Label of test indicator:
<Ind_Axis.SOCEQ: 2>
# Load population table.
population_total_df = pd.read_csv('population_total.csv',
skip_blank_lines=False, index_col=0)
# Replace tables and labels.
for name in gross_names_lst:
df_dict[name + '_popad'] = df_dict.pop(name) / population_total_df
df_dict[name + '_popad'].columns.name = 'year'
labels_sr[name + '_popad'] = labels_sr.pop(name)
# Check after.
print('Is the new value equal to the old value divided by the corresponding \
population value?')
df_dict['number_of_people_in_poverty_popad']['1978'].loc[test_val_loc.index[0]]\
== test_val_loc[0] / population_total_df['1978'].loc[test_val_loc.index[0]]
print('Same shape?')
df_dict['number_of_people_in_poverty_popad'].shape == test_shape
print('Same total count?')
test_count == df_dict['number_of_people_in_poverty_popad'].count().sum()
print('The right label at the right ndex?')
test_label == labels_sr['number_of_people_in_poverty_popad']
print('Are the old tables and labels still extant?')
pd.Series(df_dict.keys()).isin(gross_names_lst).any()
labels_sr.index.isin([gross_names_lst]).any()
Is the new value equal to the old value divided by the corresponding population value?
True
Same shape?
False
Same total count?
True
The right label at the right ndex?
True
Are the old tables and labels still extant?
False
False
Because the population table had a different shape than the people in poverty table, dividing by the population table reshaped the poverty table, adding a lot of years and countries, which brought in a lot of NaNs.
But, the operation appears to have worked on a sample column, and our sample table has the same number of observations as before.
Let's trim those extra rows and columns with all NaNs. In fact, let's do this for all the tables. This might save compute later.
for df_name, df in df_dict.items():
df_dict[df_name] = df.dropna(how='all', axis=0)
df_dict[df_name] = df.dropna(how='all', axis=1)
print('Shape and count:')
df_dict['number_of_people_in_poverty_popad'].shape
df_dict['number_of_people_in_poverty_popad'].count().sum()
Shape and count:
(195, 36)
1099
# whos
del df_name
del get_max_lists
del gross_names_lst
del max_over_100_lst
del name
del preculled_amb_lst
del remove_names
del remove_names_lst
del sub_lst
del test_count
del test_label
del test_shape
del test_val_loc
Okay, let's dig in!
Cufflinks is wrapper of Plotly.plot enabling you to use it on Pandas Dataframes and Series'.
Examples: https://github.com/santosjorge/cufflinks/blob/master/Cufflinks%20Tutorial%20-%20Pandas%20Like.ipynb
Documentation: https://github.com/santosjorge/cufflinks https://github.com/santosjorge/cufflinks/blob/master/Cufflinks%20Tutorial%20-%20Plotly.ipynb
I want to test it out on the EIU democracy indicator. The iplot function alone has a long list of optional parameters, but playing with the kind parameter, I found a neat interactive 3d surface map that gives you a sense of the varability of the data overall as well as a way to peak at each country over the years. Hover over the upper-right corner to change the way you interact with the map.
get_better_desc(df_dict['demox_eiu'])[0]
| year | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 164.000000 | 164.000000 | 164.000000 | 164.000000 | 164.000000 | 164.000000 | 164.000000 | 164.000000 | 164.000000 | 164.000000 | 164.000000 | 164.000000 | 164.000000 |
| mean | 0.550591 | 0.552229 | 0.553762 | 0.549543 | 0.545232 | 0.548055 | 0.550134 | 0.551378 | 0.552799 | 0.553134 | 0.549823 | 0.545384 | 0.546061 |
| std | 0.227059 | 0.226048 | 0.225813 | 0.224348 | 0.224033 | 0.219270 | 0.220242 | 0.220460 | 0.218922 | 0.220598 | 0.220813 | 0.220017 | 0.221183 |
| min | 0.103000 | 0.094500 | 0.086000 | 0.097000 | 0.108000 | 0.108000 | 0.108000 | 0.108000 | 0.108000 | 0.108000 | 0.108000 | 0.108000 | 0.108000 |
| 25% | 0.341000 | 0.347500 | 0.345750 | 0.347000 | 0.341000 | 0.352000 | 0.355000 | 0.352000 | 0.352750 | 0.365000 | 0.355500 | 0.361000 | 0.346500 |
| 50% | 0.585500 | 0.588500 | 0.585000 | 0.580500 | 0.585000 | 0.584000 | 0.585000 | 0.584000 | 0.577500 | 0.574000 | 0.571000 | 0.566000 | 0.566000 |
| 75% | 0.737250 | 0.737250 | 0.735250 | 0.726250 | 0.721250 | 0.717500 | 0.718000 | 0.725250 | 0.739250 | 0.731000 | 0.725250 | 0.720000 | 0.716750 |
| max | 0.988000 | 0.988000 | 0.988000 | 0.974000 | 0.980000 | 0.980000 | 0.993000 | 0.993000 | 0.993000 | 0.993000 | 0.993000 | 0.987000 | 0.987000 |
| kurt | -1.110488 | -1.102634 | -1.090336 | -1.104593 | -1.116748 | -1.049147 | -1.015978 | -1.020799 | -1.050014 | -1.056156 | -1.054437 | -1.023767 | -1.040529 |
| skew | -0.058706 | -0.052904 | -0.055869 | -0.043474 | -0.041246 | -0.048756 | -0.055812 | -0.078980 | -0.059019 | -0.101974 | -0.077686 | -0.054467 | -0.054551 |
# help(df_dict['demox_eiu'].iplot)
df_dict['demox_eiu'].iplot(kind='surface', title='demox_eiu', xTitle='Country',
yTitle='Year', zTitle='Democracy Score')
Something that bothers me about iplot (other than the fact that it's axis titles parameters don't work) is that I didn't find a way to easily plot along the columns (e.g. axis=1). I used a the surface map above to get around this, but I'd like to see a table as line charts for each row, that is, for each country on this table. I have to transpose the dataframe then plot it.
It gave me a very busy line chart since there are 164 countries, but something jumps out at me: countries with lower democracy scores tend to have more unstable scores over the years than those with higher scores. Another way to say that is that those countries with stonger democracy scores tend to stay that way.
There's not enough information here to determine why that is the case. We can't say whether democratic conditions create the conditions for democracratic conditions to continue, or whether there's some other underlying factor. It just simply appears to be the case that strong democracies tend to stay strong democracies, in the timespan we're observing.
df_dict['demox_eiu'].transpose().iplot()
We can see the same feature a little more dramatically by comparing scatter plots of the years compared to other years.
It would be tedious to chart all 196 comparisons of the 14 years, so let's look at the earliest year compared to the latest year and some years in between.
You can watch the correlation weaken as time passes. In other words, without drawing any true predictive conclusions, it appears that the strength of a country's democracy in one year might predict the strength of its democracy in following years, but less so the further forward you look.
Notice the way the later scatter plots seem to loosen up more around their bottoms than their tops. Again, without drawing any conclusions about causation, this suggests that countries with weaker democracies may be more likely to improve or degrade their democratic features. Or, full-fledged democracies may be more likely to stay that way.
Also, notice how iplot draws scatter plots with lines instead of dots. I used Plotly Express instead.
# # Creates a scatterplot with lines instead of dots. Why?
# help(df_dict['demox_eiu'].iplot)
# df_dict['demox_eiu'].iplot(kind='scatter', x='2006', y='2007',
# symbol='circle-dot', theme='white')
px.scatter(df_dict['demox_eiu'], x='2006', y='2007',
title='EIU Democracy Score')
print('r =', df_dict['demox_eiu']['2006'].corr(df_dict['demox_eiu']['2007'],
method='pearson'))
px.scatter(df_dict['demox_eiu'], x='2006', y='2012')
print('r =', df_dict['demox_eiu']['2006'].corr(df_dict['demox_eiu']['2012'],
method='pearson'))
px.scatter(df_dict['demox_eiu'], x='2006', y='2018')
print('r =', df_dict['demox_eiu']['2006'].corr(df_dict['demox_eiu']['2018'],
method='pearson'))
r = 0.9983068076521273
r = 0.9591564488398729
r = 0.9407853793757875
This might be worth revisiting in a more rigorous analysis beyond the scope of this exploratory project. You might start by dividing the countries into those with a mean score above the table mean score and those with a mean score below, then compare the variance of the two groups, and do the same for other measures of democracy. Again, this is beyond the scope of this project.
Let's move onto the main research question regarding the larger set of tables.
To get at this question, I need to be able to compare entire tables to each other, to correlate point by point at a scalar level. I found no direct way to correlate an entire 2d dataframe with another, only column to column or row to row.
So, to compare two indicators (i.e. two dataframes loaded from the csv files), I need to convert each indicator into a single column or series. Furthermore, each column needs to have the same shape in order to satisfy correlation function requirements. They need to share an index with the rest, or at least string the values in the same order, to compare the same (country, year) fields. Also, it would be best if they were all on the same table to make for easier operations.
Below, I do that. The resulting dataframe is such that each column is the values of a single indicator, and the index is a Pandas MultiIndex of years and countries.
# Make a copy to avoid in-place modification.
df_dict_copy = df_dict.copy()
# Stack each table into a single, multi-indexed series.
for key, df in df_dict_copy.items():
df_dict_copy[key] = df.stack()
# Add stacked indicators as columns in a new dataframe.
stacked_indicators_df = pd.concat(df_dict_copy.values(), keys=df_dict_copy.keys(),
names=['indicator'], axis=1)
del df_dict_copy
stacked_indicators_df
| indicator | abscorrup_idea | aged_15plus_unemployment_rate_percent | aged_15_24_unemployment_rate_percent | aged_25_54_unemployment_rate_percent | aged_55_64_unemployment_rate_percent | aged_65plus_unemployment_rate_percent | ajustice_idea | alternative_poverty_percent_below_nationally_defined_poverty | bwelfr_idea | celec_idea | ... | sl_uem_basc_zs | sl_uem_intm_zs | sl_uem_neet_zs | sl_uem_totl_ne_zs | sl_uem_totl_zs | sorigeq_idea | children_out_of_school_primary_popad | number_of_people_in_poverty_popad | se_prm_enrl_popad | se_prm_uner_popad | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| country | year | |||||||||||||||||||||
| Afghanistan | 1800 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1801 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
| 1802 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
| 1803 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
| 1804 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| Zimbabwe | 2036 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2037 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
| 2038 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
| 2039 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
| 2040 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
46995 rows × 164 columns
get_better_desc(stacked_indicators_df)[0]
get_better_desc(stacked_indicators_df)[1]
| indicator | abscorrup_idea | aged_15plus_unemployment_rate_percent | aged_15_24_unemployment_rate_percent | aged_25_54_unemployment_rate_percent | aged_55_64_unemployment_rate_percent | aged_65plus_unemployment_rate_percent | ajustice_idea | alternative_poverty_percent_below_nationally_defined_poverty | bwelfr_idea | celec_idea | ... | sl_uem_basc_zs | sl_uem_intm_zs | sl_uem_neet_zs | sl_uem_totl_ne_zs | sl_uem_totl_zs | sorigeq_idea | children_out_of_school_primary_popad | number_of_people_in_poverty_popad | se_prm_enrl_popad | se_prm_uner_popad |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 6417.000000 | 3028.000000 | 2102.000000 | 2089.000000 | 2035.000000 | 1542.000000 | 6387.000000 | 777.000000 | 6417.000000 | 6393.000000 | ... | 1667.000000 | 1645.000000 | 1379.000000 | 4016.000000 | 5191.000000 | 6417.000000 | 4170.000000 | 1099.000000 | 7379.000000 | 4170.000000 |
| mean | 0.458891 | 0.083525 | 0.177745 | 0.069049 | 0.049717 | 0.026295 | 0.564745 | 0.287472 | 0.506817 | 0.492970 | ... | 11.229924 | 9.391588 | 17.352191 | 8.077611 | 8.021135 | 0.434915 | 0.020146 | 0.142330 | 0.118379 | 0.020146 |
| std | 0.204968 | 0.057907 | 0.111125 | 0.050627 | 0.039840 | 0.028378 | 0.194119 | 0.163596 | 0.204034 | 0.293927 | ... | 8.405251 | 6.521370 | 9.028356 | 5.882734 | 6.253656 | 0.189795 | 0.029067 | 0.207472 | 0.051212 | 0.029067 |
| min | 0.000000 | 0.001400 | 0.004800 | 0.000900 | 0.000200 | 0.000000 | 0.000000 | 0.004000 | 0.000000 | 0.000000 | ... | 0.120000 | 0.179000 | 0.062600 | 0.050000 | 0.091000 | 0.000000 | 0.000000 | 0.000000 | 0.004646 | 0.000000 |
| 25% | 0.307000 | 0.043500 | 0.096700 | 0.035800 | 0.023150 | 0.009810 | 0.418500 | 0.167000 | 0.344000 | 0.270000 | ... | 5.415000 | 5.020000 | 10.600000 | 4.000000 | 3.600000 | 0.289000 | 0.001206 | 0.004463 | 0.077425 | 0.001206 |
| 50% | 0.434000 | 0.070000 | 0.154000 | 0.057100 | 0.040500 | 0.019000 | 0.549000 | 0.256000 | 0.503000 | 0.527000 | ... | 9.240000 | 7.750000 | 16.600000 | 6.705000 | 6.260000 | 0.419000 | 0.006190 | 0.044243 | 0.114286 | 0.006190 |
| 75% | 0.582000 | 0.107000 | 0.231000 | 0.085600 | 0.064550 | 0.034000 | 0.693000 | 0.384000 | 0.660000 | 0.763000 | ... | 14.400000 | 11.400000 | 22.050000 | 10.500000 | 10.700000 | 0.563000 | 0.025592 | 0.174339 | 0.155385 | 0.025592 |
| max | 1.000000 | 0.379000 | 0.631000 | 0.333000 | 0.326000 | 0.430000 | 1.000000 | 0.833000 | 1.000000 | 1.000000 | ... | 53.200000 | 95.700000 | 68.600000 | 38.800000 | 38.000000 | 1.000000 | 0.146735 | 0.949541 | 0.386838 | 0.146735 |
| kurt | -0.269317 | 3.247544 | 1.778632 | 4.174742 | 6.780309 | 43.888387 | -0.568659 | 0.020714 | -0.878488 | -1.127393 | ... | 2.859930 | 19.970482 | 1.449746 | 3.047683 | 3.112874 | -0.324316 | 2.904096 | 2.401206 | 0.158820 | 2.904096 |
| skew | 0.535013 | 1.596247 | 1.262493 | 1.838275 | 2.005997 | 4.772936 | 0.227055 | 0.734354 | 0.100551 | -0.281550 | ... | 1.538761 | 2.636335 | 0.901226 | 1.538669 | 1.597697 | 0.412401 | 1.870205 | 1.799299 | 0.488835 | 1.870205 |
10 rows × 164 columns
| count | mean | std | min | 25% | 50% | 75% | max | kurt | skew | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 164.000000 | 164.000000 | 164.000000 | 164.000000 | 164.000000 | 164.000000 | 164.000000 | 164.000000 | 163.000000 | 163.000000 |
| mean | 3231.347561 | 24.141491 | 10.126765 | 5.168516 | 17.513715 | 23.380406 | 29.994133 | 68.925780 | 3.489540 | 0.607324 |
| std | 4391.211651 | 55.494546 | 16.297609 | 27.938552 | 51.304279 | 57.523078 | 61.754274 | 205.307835 | 15.212752 | 1.590509 |
| min | 2.000000 | -1.066326 | 0.019159 | -10.000000 | -7.000000 | -3.000000 | 0.000000 | 0.140000 | -1.545688 | -1.933794 |
| 25% | 558.750000 | 0.564294 | 0.259400 | 0.000000 | 0.390250 | 0.600250 | 0.784750 | 1.000000 | -0.659921 | -0.232535 |
| 50% | 1774.000000 | 10.736559 | 6.991085 | 0.047700 | 4.472500 | 7.440000 | 12.900000 | 39.350000 | 0.065903 | 0.410138 |
| 75% | 5191.000000 | 26.903915 | 15.929134 | 1.125000 | 15.800000 | 25.300000 | 35.381250 | 90.575000 | 2.642568 | 1.236642 |
| max | 46995.000000 | 481.811594 | 157.095252 | 276.000000 | 462.000000 | 503.000000 | 531.000000 | 2490.000000 | 143.719604 | 11.930122 |
| kurt | 60.516111 | 53.852726 | 42.230157 | 77.119836 | 58.823934 | 53.015691 | 48.090993 | 120.627231 | 57.946951 | 19.055118 |
| skew | 6.378822 | 6.839969 | 5.352904 | 8.651073 | 7.280741 | 6.784226 | 6.324341 | 10.397686 | 7.221247 | 3.165955 |
It looks like there's an indicator with only 2 values (see the super describe table [min, count]). I want to remove that indicator, and possibly others with few values. I'll set a relatively low minimum value count of 30 and remove all tables with fewer observations than that. I might find later that I need to raise the bar if I run into problems.
And, I'll strip any rows that ended up with all NaNs after removing columns. I'll also need to remove from the labels series any indicators I removed from the dataframe.
# Remove columns that don't have at least 20 non-NaN values
stacked_indicators_df.dropna(axis=1, how='any', thresh=(30), inplace=True)
# Remove rows that have all NaN.
stacked_indicators_df.dropna(axis=0, how='all', inplace=True)
# Remove labels of missing indicators.
drop_ar = np.setdiff1d(ar1=labels_sr.index.values,
ar2=stacked_indicators_df.columns.values,
assume_unique=True)
labels_sr.drop(labels=drop_ar, inplace=True)
I also see that my maximum value is 2,490. I wasn't anticipating values this high, but it's not necessarily a bug. Although, I do want to look more closely at indicators with high max values, say over 300.
high_max_lst = [col_name for col_name, col in stacked_indicators_df.items()
if col.max() > 300]
high_max_lst
['math_achievement_4th_grade', 'math_achievement_8th_grade', 'se_xpd_tert_pc_zs']
We've got a couple of tables of test scores that I recall from looking at them are in the hundreds, so those are fine.
The other table is the table with the highest value, and it's government expenditure per tertiary student, expressed as a percentage of GDP per capita.
2,490% seems like a lot. You can see below that it's many IQRs above the third quartile. I should remove outliers from this and other indicators.
First, I want to see which country this is!
# Find it in the indicators dataframe.
stacked_indicators_df['se_xpd_tert_pc_zs']\
[stacked_indicators_df['se_xpd_tert_pc_zs'].isin([2490])]
country year
Malawi 1999 2490.0
2000 2490.0
Name: se_xpd_tert_pc_zs, dtype: float64
# Double check in the original table.
mask = df_dict['se_xpd_tert_pc_zs'].isin([2490])
df_dict['se_xpd_tert_pc_zs'].loc[mask.any(axis=1), mask.any(axis=0)]
| year | 1999 | 2000 |
|---|---|---|
| country | ||
| Malawi | 2490.0 | 2490.0 |
Yes, let's remove outliers from this indicator and any others that need it. I will arbitrarily remove values with z-scores greater than 3. For another project, I might need to develop a more rigorous algorithm for setting the z-score limit for each indicator. For instance, I might use the IQR. But, I use quantile() elsewhere in this analysis, so I want to try out zscore() here.
def remove_outliers(s: pd.Series, high_z: float = 3): # No return value.
'''Set outliers to None.
Parameters:
s: pandas.Series to remove outliers from.
high_z: float z-score to cut off.
Returns:
s: pandas.Series argument outliers set to None.'''
# Get z-scores.
zscores = np.abs(stats.zscore(s, nan_policy='omit'))
# Get indices of outliers.
outlier_indices = np.where(zscores > high_z)
# Set these values to None.
s.iloc[outlier_indices[0]] = None
return s
# Check the column before.
stacked_indicators_df['se_xpd_tert_pc_zs'].describe()
count 1404.000000 mean 67.576117 std 157.095252 min 0.000000 25% 20.875000 50% 30.400000 75% 49.325000 max 2490.000000 Name: se_xpd_tert_pc_zs, dtype: float64
# Set outlier values to None.
stacked_indicators_df = stacked_indicators_df.apply(
lambda col: remove_outliers(s=col))
# Check column after.
stacked_indicators_df['se_xpd_tert_pc_zs'].describe()
count 1381.000000 mean 51.509680 std 64.355276 min 0.000000 25% 20.800000 50% 30.000000 75% 47.100000 max 476.000000 Name: se_xpd_tert_pc_zs, dtype: float64
Seems to have done the trick.
# Clean up.
stacked_indicators_df.dropna(axis=0, how='all', inplace=True)
# whos
del ambiguous_max_lst
del drop_ar
del high_max_lst
del key
del mask
I'd love to throw this table of indicators at pandas_profiling, but it proved too much for my 32GB of RAM. We'll hold off for a smaller selection of indicators.
We're ready to find the correlations between all these indicators and get a bird's eye view of the relationships with a heatmap of the r values. I'll arbitrarily set the minimum number of observations needed for correlation to 20.
# Each indicator correlated with the rest, three different ways.
cors_dfs_dict = {'Pears': stacked_indicators_df.corr(method='pearson',
min_periods=20)}#,
# 'Kendall': stacked_indicators_df.corr(method='kendall'),
# 'Spears': stacked_indicators_df.corr(method='spearman')}
cors_dfs_dict['Pears']
| indicator | abscorrup_idea | aged_15plus_unemployment_rate_percent | aged_15_24_unemployment_rate_percent | aged_25_54_unemployment_rate_percent | aged_55_64_unemployment_rate_percent | aged_65plus_unemployment_rate_percent | ajustice_idea | alternative_poverty_percent_below_nationally_defined_poverty | bwelfr_idea | celec_idea | ... | sl_uem_basc_zs | sl_uem_intm_zs | sl_uem_neet_zs | sl_uem_totl_ne_zs | sl_uem_totl_zs | sorigeq_idea | children_out_of_school_primary_popad | number_of_people_in_poverty_popad | se_prm_enrl_popad | se_prm_uner_popad |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| indicator | |||||||||||||||||||||
| abscorrup_idea | 1.000000 | 0.052642 | -0.057722 | 0.071821 | 0.249355 | -0.048495 | 0.762772 | -0.309892 | 0.609933 | 0.624004 | ... | 0.205875 | -0.191510 | -0.510108 | 0.038034 | 0.063780 | 0.690532 | -0.343882 | -0.406755 | -0.319884 | -0.343882 |
| aged_15plus_unemployment_rate_percent | 0.052642 | 1.000000 | 0.906335 | 0.978170 | 0.769491 | 0.407975 | 0.118145 | 0.042385 | 0.023975 | 0.157044 | ... | 0.778825 | 0.876531 | 0.330379 | 0.996947 | 0.943929 | 0.093590 | 0.007963 | -0.257109 | 0.020003 | 0.007963 |
| aged_15_24_unemployment_rate_percent | -0.057722 | 0.906335 | 1.000000 | 0.864624 | 0.607262 | 0.308410 | 0.049414 | -0.123732 | 0.060773 | 0.100327 | ... | 0.690561 | 0.820983 | 0.320073 | 0.906640 | 0.893731 | 0.066823 | -0.067243 | -0.235149 | -0.110681 | -0.067243 |
| aged_25_54_unemployment_rate_percent | 0.071821 | 0.978170 | 0.864624 | 1.000000 | 0.796206 | 0.369908 | 0.155892 | -0.040877 | 0.105444 | 0.192879 | ... | 0.785033 | 0.843226 | 0.237051 | 0.977334 | 0.952349 | 0.161082 | -0.059301 | -0.288677 | -0.153682 | -0.059301 |
| aged_55_64_unemployment_rate_percent | 0.249355 | 0.769491 | 0.607262 | 0.796206 | 1.000000 | 0.442801 | 0.309919 | -0.042407 | 0.268503 | 0.316939 | ... | 0.796566 | 0.641088 | 0.091972 | 0.772180 | 0.736034 | 0.307411 | -0.091946 | -0.270964 | -0.270325 | -0.091946 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| sorigeq_idea | 0.690532 | 0.093590 | 0.066823 | 0.161082 | 0.307411 | -0.018910 | 0.806924 | -0.415091 | 0.765622 | 0.748948 | ... | 0.314634 | -0.142862 | -0.578720 | 0.061656 | 0.078379 | 1.000000 | -0.473558 | -0.480821 | -0.374846 | -0.473558 |
| children_out_of_school_primary_popad | -0.343882 | 0.007963 | -0.067243 | -0.059301 | -0.091946 | 0.057035 | -0.374754 | 0.437834 | -0.735715 | -0.400359 | ... | -0.135445 | 0.103061 | 0.355269 | -0.026377 | -0.124469 | -0.473558 | 1.000000 | 0.653658 | 0.218573 | 1.000000 |
| number_of_people_in_poverty_popad | -0.406755 | -0.257109 | -0.235149 | -0.288677 | -0.270964 | 0.002428 | -0.424115 | 0.585220 | -0.775079 | -0.440336 | ... | -0.316444 | 0.053275 | 0.227221 | -0.188332 | -0.258366 | -0.480821 | 0.653658 | 1.000000 | 0.435128 | 0.653658 |
| se_prm_enrl_popad | -0.319884 | 0.020003 | -0.110681 | -0.153682 | -0.270325 | 0.085859 | -0.321063 | 0.576810 | -0.432646 | -0.200407 | ... | -0.348765 | 0.007680 | 0.320319 | 0.036665 | -0.049184 | -0.374846 | 0.218573 | 0.435128 | 1.000000 | 0.218573 |
| se_prm_uner_popad | -0.343882 | 0.007963 | -0.067243 | -0.059301 | -0.091946 | 0.057035 | -0.374754 | 0.437834 | -0.735715 | -0.400359 | ... | -0.135445 | 0.103061 | 0.355269 | -0.026377 | -0.124469 | -0.473558 | 1.000000 | 0.653658 | 0.218573 | 1.000000 |
163 rows × 163 columns
cors_better_desc, cors_super_desc = get_better_desc(cors_dfs_dict['Pears'])
cors_better_desc
cors_super_desc
| indicator | abscorrup_idea | aged_15plus_unemployment_rate_percent | aged_15_24_unemployment_rate_percent | aged_25_54_unemployment_rate_percent | aged_55_64_unemployment_rate_percent | aged_65plus_unemployment_rate_percent | ajustice_idea | alternative_poverty_percent_below_nationally_defined_poverty | bwelfr_idea | celec_idea | ... | sl_uem_basc_zs | sl_uem_intm_zs | sl_uem_neet_zs | sl_uem_totl_ne_zs | sl_uem_totl_zs | sorigeq_idea | children_out_of_school_primary_popad | number_of_people_in_poverty_popad | se_prm_enrl_popad | se_prm_uner_popad |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 163.000000 | 163.000000 | 163.000000 | 163.000000 | 163.000000 | 163.000000 | 163.000000 | 161.000000 | 163.000000 | 163.000000 | ... | 163.000000 | 163.000000 | 163.000000 | 163.000000 | 163.000000 | 163.000000 | 163.000000 | 157.000000 | 163.000000 | 163.000000 |
| mean | 0.225594 | 0.116278 | 0.107243 | 0.121511 | 0.143996 | 0.039820 | 0.260172 | -0.089242 | 0.256294 | 0.242321 | ... | 0.152439 | 0.040430 | -0.097741 | 0.113219 | 0.117646 | 0.278980 | -0.177263 | -0.198252 | -0.130335 | -0.177263 |
| std | 0.383484 | 0.277104 | 0.288123 | 0.290085 | 0.276077 | 0.164958 | 0.425897 | 0.365204 | 0.489539 | 0.393141 | ... | 0.296384 | 0.280880 | 0.338130 | 0.275329 | 0.271997 | 0.436829 | 0.396191 | 0.370504 | 0.382592 | 0.396191 |
| min | -0.724160 | -0.640041 | -0.704953 | -0.685976 | -0.538857 | -0.473975 | -0.814804 | -0.698025 | -0.829881 | -0.766806 | ... | -0.862658 | -0.733794 | -0.690291 | -0.689183 | -0.551393 | -0.768125 | -0.966745 | -0.775079 | -0.753752 | -0.966745 |
| 25% | -0.029511 | -0.021686 | -0.070998 | -0.056362 | -0.051931 | -0.059621 | 0.032908 | -0.352715 | 0.016985 | 0.003603 | ... | -0.072519 | -0.103499 | -0.364026 | -0.020603 | -0.045970 | 0.031033 | -0.413887 | -0.429215 | -0.370212 | -0.413887 |
| 50% | 0.288177 | 0.068696 | 0.062797 | 0.095404 | 0.153710 | 0.010282 | 0.309919 | -0.158477 | 0.430906 | 0.237509 | ... | 0.184731 | -0.021459 | -0.161850 | 0.070653 | 0.085859 | 0.358000 | -0.267175 | -0.292249 | -0.187722 | -0.267175 |
| 75% | 0.512477 | 0.161228 | 0.164771 | 0.194550 | 0.297108 | 0.101569 | 0.570666 | 0.078421 | 0.633421 | 0.546739 | ... | 0.315100 | 0.107232 | 0.143105 | 0.144982 | 0.203952 | 0.620777 | 0.006466 | -0.057399 | 0.039917 | 0.006466 |
| max | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ... | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
| kurt | -0.579293 | 2.734700 | 1.858768 | 1.835449 | 0.231577 | 6.922361 | -0.629701 | 0.055475 | -0.549944 | -0.691604 | ... | 0.608741 | 2.565419 | -0.446071 | 3.021642 | 2.534835 | -0.644240 | 0.610651 | 0.444022 | -0.200062 | 0.610651 |
| skew | -0.440100 | 1.267509 | 1.165766 | 0.974008 | 0.195533 | 1.498850 | -0.479898 | 0.834017 | -0.833608 | -0.097289 | ... | -0.114539 | 1.296769 | 0.518357 | 1.279202 | 1.323017 | -0.679814 | 0.755932 | 1.032787 | 0.640399 | 0.755932 |
10 rows × 163 columns
| count | mean | std | min | 25% | 50% | 75% | max | kurt | skew | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 163.000000 | 163.000000 | 163.000000 | 163.000000 | 163.000000 | 163.000000 | 163.000000 | 1.630000e+02 | 163.000000 | 163.000000 |
| mean | 160.288344 | 0.093882 | 0.346563 | -0.702549 | -0.118394 | 0.099161 | 0.310692 | 1.000000e+00 | 0.693229 | 0.173053 |
| std | 6.932411 | 0.133566 | 0.076622 | 0.180730 | 0.158183 | 0.181529 | 0.184938 | 1.926915e-16 | 2.045977 | 0.684378 |
| min | 111.000000 | -0.222975 | 0.147224 | -0.994626 | -0.635476 | -0.340235 | -0.062205 | 1.000000e+00 | -1.197010 | -0.875723 |
| 25% | 161.000000 | 0.030392 | 0.281760 | -0.831870 | -0.159158 | 0.005355 | 0.149481 | 1.000000e+00 | -0.569418 | -0.412426 |
| 50% | 163.000000 | 0.126744 | 0.371853 | -0.716205 | -0.051931 | 0.135926 | 0.322782 | 1.000000e+00 | 0.055475 | 0.053832 |
| 75% | 163.000000 | 0.194427 | 0.402813 | -0.608601 | -0.011140 | 0.227931 | 0.466118 | 1.000000e+00 | 0.967202 | 0.739001 |
| max | 163.000000 | 0.278980 | 0.496540 | -0.247357 | 0.066736 | 0.430906 | 0.633421 | 1.000000e+00 | 10.870648 | 1.816703 |
| kurt | 26.256311 | -0.465996 | -0.431246 | -0.428835 | 0.445130 | -0.277882 | -1.179692 | 0.000000e+00 | 8.512707 | -0.721363 |
| skew | -4.635171 | -0.795462 | -0.641364 | 0.333457 | -1.278493 | -0.686970 | -0.168429 | 0.000000e+00 | 2.686707 | 0.463253 |
cors_dfs_dict['Pears'].iplot(kind='heatmap',
title='Correlating indicators, Pearson\'s r')
Here's another busy plot. There are some big-picture features to notice that might direct our inquiry further, though.
There's the expected diagonal where r=1 because it's self-correlation. And, there a are a handful of NaN points where the correlated indicators did not have enough intersecting data points to create a significant r value.
The heatmap is also noticably criss-crossed with horizontal/vertical lines (representing the correlations of a single indicator) of low-to-moderate correlation. The indicators least correlated to most of the rest of the indicators seem to be grouped together with others from the same sources representing similar domains, such as data indicators from the World Bank about the duration and ages in educational levels (e.g. se_prm_durs, primary education duration in years).
Pockets of high correlation (positive or negative) jump out, too. These are typically similar or logically related indicators. For instance, multiple orgs (IDEA, EIU, etc.) have produced baskets of indicators of different aspects of democracy (e.g. political participation, free and open elections, etc.). These are all highly correlated, even across sources. This confirms that Democracy is in fact a coherent, cohesive construct. Indicators of distribution of wealth are also highly correlated to each other, and must be, by definition. Though these high correlations are mostly unsurprising, looking closer at the nature of these relationships (e.g. linear or not?) may prove interesting for another study.
Looking at this heatmap, I don't see a lot of evidence for my initial belief that democracy, socioeconomic equity, and education are interdependently correlated. In fact, I see a lot of evidence to the contrary. Multiple educational indicators are not correlated to much of anything else, while multiple measures of income distribution are not correlated to democratic indicators. Seeing this, I definitely wouldn't make blanket statements about the connections between such broad constructs like democracy, equity, and education.
I do want to look more closely as a point of open interest (not to seek confirmation for my initial bias). Let's look at these axises of indicators (i.e. democracy, socioeconomic equity, and education). Since I'm more interested in the existence of correlation than the direction of correlation, I'll take the absolute value of the r-scores to make them easier to work with. If I want to know the direction of a particular correlation, I can still look it up.
abs_rvals = cors_dfs_dict['Pears'].apply(lambda x: np.abs(x))
abs_rvals
# absr_better_desc, absr_super_desc = get_better_desc(abs_rvals)
# absr_better_desc
# absr_super_desc
get_better_desc(abs_rvals)[0]
get_better_desc(abs_rvals)[1]
| indicator | abscorrup_idea | aged_15plus_unemployment_rate_percent | aged_15_24_unemployment_rate_percent | aged_25_54_unemployment_rate_percent | aged_55_64_unemployment_rate_percent | aged_65plus_unemployment_rate_percent | ajustice_idea | alternative_poverty_percent_below_nationally_defined_poverty | bwelfr_idea | celec_idea | ... | sl_uem_basc_zs | sl_uem_intm_zs | sl_uem_neet_zs | sl_uem_totl_ne_zs | sl_uem_totl_zs | sorigeq_idea | children_out_of_school_primary_popad | number_of_people_in_poverty_popad | se_prm_enrl_popad | se_prm_uner_popad |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| indicator | |||||||||||||||||||||
| abscorrup_idea | 1.000000 | 0.052642 | 0.057722 | 0.071821 | 0.249355 | 0.048495 | 0.762772 | 0.309892 | 0.609933 | 0.624004 | ... | 0.205875 | 0.191510 | 0.510108 | 0.038034 | 0.063780 | 0.690532 | 0.343882 | 0.406755 | 0.319884 | 0.343882 |
| aged_15plus_unemployment_rate_percent | 0.052642 | 1.000000 | 0.906335 | 0.978170 | 0.769491 | 0.407975 | 0.118145 | 0.042385 | 0.023975 | 0.157044 | ... | 0.778825 | 0.876531 | 0.330379 | 0.996947 | 0.943929 | 0.093590 | 0.007963 | 0.257109 | 0.020003 | 0.007963 |
| aged_15_24_unemployment_rate_percent | 0.057722 | 0.906335 | 1.000000 | 0.864624 | 0.607262 | 0.308410 | 0.049414 | 0.123732 | 0.060773 | 0.100327 | ... | 0.690561 | 0.820983 | 0.320073 | 0.906640 | 0.893731 | 0.066823 | 0.067243 | 0.235149 | 0.110681 | 0.067243 |
| aged_25_54_unemployment_rate_percent | 0.071821 | 0.978170 | 0.864624 | 1.000000 | 0.796206 | 0.369908 | 0.155892 | 0.040877 | 0.105444 | 0.192879 | ... | 0.785033 | 0.843226 | 0.237051 | 0.977334 | 0.952349 | 0.161082 | 0.059301 | 0.288677 | 0.153682 | 0.059301 |
| aged_55_64_unemployment_rate_percent | 0.249355 | 0.769491 | 0.607262 | 0.796206 | 1.000000 | 0.442801 | 0.309919 | 0.042407 | 0.268503 | 0.316939 | ... | 0.796566 | 0.641088 | 0.091972 | 0.772180 | 0.736034 | 0.307411 | 0.091946 | 0.270964 | 0.270325 | 0.091946 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| sorigeq_idea | 0.690532 | 0.093590 | 0.066823 | 0.161082 | 0.307411 | 0.018910 | 0.806924 | 0.415091 | 0.765622 | 0.748948 | ... | 0.314634 | 0.142862 | 0.578720 | 0.061656 | 0.078379 | 1.000000 | 0.473558 | 0.480821 | 0.374846 | 0.473558 |
| children_out_of_school_primary_popad | 0.343882 | 0.007963 | 0.067243 | 0.059301 | 0.091946 | 0.057035 | 0.374754 | 0.437834 | 0.735715 | 0.400359 | ... | 0.135445 | 0.103061 | 0.355269 | 0.026377 | 0.124469 | 0.473558 | 1.000000 | 0.653658 | 0.218573 | 1.000000 |
| number_of_people_in_poverty_popad | 0.406755 | 0.257109 | 0.235149 | 0.288677 | 0.270964 | 0.002428 | 0.424115 | 0.585220 | 0.775079 | 0.440336 | ... | 0.316444 | 0.053275 | 0.227221 | 0.188332 | 0.258366 | 0.480821 | 0.653658 | 1.000000 | 0.435128 | 0.653658 |
| se_prm_enrl_popad | 0.319884 | 0.020003 | 0.110681 | 0.153682 | 0.270325 | 0.085859 | 0.321063 | 0.576810 | 0.432646 | 0.200407 | ... | 0.348765 | 0.007680 | 0.320319 | 0.036665 | 0.049184 | 0.374846 | 0.218573 | 0.435128 | 1.000000 | 0.218573 |
| se_prm_uner_popad | 0.343882 | 0.007963 | 0.067243 | 0.059301 | 0.091946 | 0.057035 | 0.374754 | 0.437834 | 0.735715 | 0.400359 | ... | 0.135445 | 0.103061 | 0.355269 | 0.026377 | 0.124469 | 0.473558 | 1.000000 | 0.653658 | 0.218573 | 1.000000 |
163 rows × 163 columns
| indicator | abscorrup_idea | aged_15plus_unemployment_rate_percent | aged_15_24_unemployment_rate_percent | aged_25_54_unemployment_rate_percent | aged_55_64_unemployment_rate_percent | aged_65plus_unemployment_rate_percent | ajustice_idea | alternative_poverty_percent_below_nationally_defined_poverty | bwelfr_idea | celec_idea | ... | sl_uem_basc_zs | sl_uem_intm_zs | sl_uem_neet_zs | sl_uem_totl_ne_zs | sl_uem_totl_zs | sorigeq_idea | children_out_of_school_primary_popad | number_of_people_in_poverty_popad | se_prm_enrl_popad | se_prm_uner_popad |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 163.000000 | 163.000000 | 163.000000 | 163.000000 | 163.000000 | 163.000000 | 163.000000 | 161.000000 | 163.000000 | 163.000000 | ... | 163.000000 | 163.000000 | 163.000000 | 163.000000 | 163.000000 | 163.000000 | 163.000000 | 157.000000 | 163.000000 | 163.000000 |
| mean | 0.387850 | 0.195372 | 0.206617 | 0.218393 | 0.252208 | 0.115085 | 0.434415 | 0.316232 | 0.501240 | 0.377800 | ... | 0.277274 | 0.186628 | 0.303558 | 0.190362 | 0.198524 | 0.467355 | 0.368577 | 0.376792 | 0.341979 | 0.368577 |
| std | 0.216589 | 0.228002 | 0.227229 | 0.225866 | 0.181878 | 0.124419 | 0.244160 | 0.201892 | 0.230104 | 0.264628 | ... | 0.184035 | 0.213292 | 0.176713 | 0.228566 | 0.219666 | 0.222174 | 0.227807 | 0.184247 | 0.214003 | 0.227807 |
| min | 0.000928 | 0.000626 | 0.000520 | 0.001590 | 0.004029 | 0.001351 | 0.027417 | 0.000640 | 0.016800 | 0.002547 | ... | 0.043444 | 0.000028 | 0.007297 | 0.000800 | 0.001878 | 0.009054 | 0.004969 | 0.002428 | 0.007680 | 0.004969 |
| 25% | 0.256093 | 0.059401 | 0.066029 | 0.083225 | 0.128230 | 0.040735 | 0.271225 | 0.143978 | 0.333668 | 0.158350 | ... | 0.137568 | 0.055075 | 0.159773 | 0.059098 | 0.068768 | 0.322170 | 0.216504 | 0.245130 | 0.158868 | 0.216504 |
| 50% | 0.359843 | 0.118145 | 0.120092 | 0.153682 | 0.229505 | 0.080014 | 0.417821 | 0.302500 | 0.534719 | 0.326675 | ... | 0.241544 | 0.105750 | 0.307121 | 0.109840 | 0.125344 | 0.475763 | 0.336537 | 0.360404 | 0.320319 | 0.336537 |
| 75% | 0.536367 | 0.208652 | 0.248242 | 0.234634 | 0.308487 | 0.138275 | 0.613179 | 0.482666 | 0.673043 | 0.556967 | ... | 0.328199 | 0.213421 | 0.442947 | 0.195234 | 0.220081 | 0.630822 | 0.456001 | 0.500019 | 0.501769 | 0.456001 |
| max | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ... | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
| kurt | -0.266649 | 4.130014 | 3.310943 | 3.349050 | 2.380585 | 15.734015 | -0.792941 | 0.191899 | -0.632788 | -0.805408 | ... | 2.458553 | 3.768640 | 0.293624 | 4.249041 | 4.523826 | -0.585461 | 0.323029 | -0.087898 | -0.701179 | 0.323029 |
| skew | 0.384654 | 2.144577 | 1.973368 | 1.962901 | 1.373183 | 3.076391 | 0.241072 | 0.557092 | -0.472460 | 0.563743 | ... | 1.549685 | 2.060983 | 0.427655 | 2.193893 | 2.191538 | -0.054858 | 0.876469 | 0.410248 | 0.381986 | 0.876469 |
10 rows × 163 columns
| count | mean | std | min | 25% | 50% | 75% | max | kurt | skew | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 163.000000 | 163.000000 | 163.000000 | 163.000000 | 163.000000 | 163.000000 | 163.000000 | 1.630000e+02 | 163.000000 | 163.000000 |
| mean | 160.288344 | 0.312022 | 0.212453 | 0.004634 | 0.152663 | 0.281719 | 0.430319 | 1.000000e+00 | 1.979320 | 0.974915 |
| std | 6.932411 | 0.088756 | 0.040934 | 0.006052 | 0.071587 | 0.106789 | 0.130628 | 1.926915e-16 | 4.672045 | 0.879496 |
| min | 111.000000 | 0.103081 | 0.104570 | 0.000006 | 0.032839 | 0.070177 | 0.138275 | 1.000000e+00 | -1.216883 | -0.472460 |
| 25% | 161.000000 | 0.244078 | 0.185993 | 0.000912 | 0.094903 | 0.198161 | 0.338747 | 1.000000e+00 | -0.457994 | 0.387392 |
| 50% | 163.000000 | 0.332182 | 0.215861 | 0.002096 | 0.144637 | 0.287331 | 0.471964 | 1.000000e+00 | 0.159116 | 0.809801 |
| 75% | 163.000000 | 0.371008 | 0.243293 | 0.006620 | 0.202568 | 0.361178 | 0.527513 | 1.000000e+00 | 2.331137 | 1.361166 |
| max | 163.000000 | 0.501240 | 0.277576 | 0.043444 | 0.333668 | 0.534719 | 0.693074 | 1.000000e+00 | 31.338345 | 4.247861 |
| kurt | 26.256311 | -0.582568 | -0.375937 | 11.534364 | -0.428541 | -0.689956 | -0.724668 | 0.000000e+00 | 14.708252 | 1.142517 |
| skew | -4.635171 | -0.411866 | -0.380811 | 2.788073 | 0.499928 | -0.061930 | -0.530690 | 0.000000e+00 | 3.396693 | 1.114549 |
We can see that most of the lower two quartiles of the absolute values of r-scores are significantly low enough to consider their indicators uncorrelated (i.e. absr_super_desc[75%, 50%] = 0.36). That is, most indicators don't correlated well with most other indicators. And, only a portion of the upper quartiles pf r-scores are high enough to consider their indicators strongly correlated (see absr_super_desc[max, 75%] = 0.74).
Before we compare indicators across axises, let's have a quick peak at the heatmap of the absolute value of r-values.
abs_rvals.iplot(kind='heatmap',
title='Absolute Values of Correlations of Indicators')
The pockets of correlation and axes of noncorrelation are clearer now.
Let's group indicators.
# Get index of label groups to index "views" of axises.
dem_ind_names_arr = labels_sr[labels_sr == Ind_Axis.DEM].index.values
soceq_ind_names_arr = labels_sr[labels_sr == Ind_Axis.SOCEQ].index.values
edu_ind_names_arr = labels_sr[labels_sr == Ind_Axis.EDU].index.values
# dem_ind_names_arr
# soceq_ind_names_arr
# edu_ind_names_arr
# Get "views" of axises from correlations table.
print('Democracy Axis')
dem_df = abs_rvals[dem_ind_names_arr]
# dem_df_better_desc, dem_df_super_desc = get_better_desc(dem_df)
# dem_df_super_desc
get_better_desc(dem_df)[0]
get_better_desc(dem_df)[1]
print('SocioEq Axis')
soceq_df = abs_rvals[soceq_ind_names_arr]
# soceq_df_better_desc, soceq_df_super_desc = get_better_desc(soceq_df)
# soceq_df_super_desc
get_better_desc(soceq_df)[0]
get_better_desc(soceq_df)[1]
print('Education Axis')
edu_df = abs_rvals[edu_ind_names_arr]
# edu_df_better_desc, edu_df_super_desc = get_better_desc(edu_df)
# edu_df_super_desc
get_better_desc(edu_df)[0]
get_better_desc(edu_df)[1]
Democracy Axis
| indicator | abscorrup_idea | ajustice_idea | bwelfr_idea | celec_idea | checkgvt_idea | civlibx_eiu | civsocpart_idea | cliberties_idea | corruption_perception_index_cpi | democracy_score_use_as_color | ... | judicindep_idea | locdemo_idea | mediaint_idea | pisec_idea | polcultx_eiu | polpartix_eiu | prenforct_idea | rgov_idea | sgeq_idea | sorigeq_idea |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 163.000000 | 163.000000 | 163.000000 | 163.000000 | 163.000000 | 163.000000 | 163.000000 | 163.000000 | 157.000000 | 158.000000 | ... | 163.000000 | 163.000000 | 163.000000 | 163.000000 | 163.000000 | 163.000000 | 163.000000 | 163.000000 | 163.000000 | 163.000000 |
| mean | 0.387850 | 0.434415 | 0.501240 | 0.377800 | 0.380902 | 0.367262 | 0.325480 | 0.374348 | 0.397836 | 0.334013 | ... | 0.361935 | 0.363017 | 0.325656 | 0.437199 | 0.323997 | 0.383161 | 0.407891 | 0.373438 | 0.400055 | 0.467355 |
| std | 0.216589 | 0.244160 | 0.230104 | 0.264628 | 0.269382 | 0.265579 | 0.268717 | 0.270536 | 0.238263 | 0.264714 | ... | 0.243402 | 0.252676 | 0.263667 | 0.233680 | 0.208795 | 0.217217 | 0.249653 | 0.271365 | 0.214898 | 0.222174 |
| min | 0.000928 | 0.027417 | 0.016800 | 0.002547 | 0.019504 | 0.002096 | 0.006295 | 0.016291 | 0.002306 | 0.001380 | ... | 0.005848 | 0.001158 | 0.000963 | 0.018223 | 0.001203 | 0.010282 | 0.015745 | 0.003657 | 0.009775 | 0.009054 |
| 25% | 0.256093 | 0.271225 | 0.333668 | 0.158350 | 0.166744 | 0.146132 | 0.110666 | 0.176817 | 0.219562 | 0.112173 | ... | 0.177342 | 0.144637 | 0.129031 | 0.286801 | 0.155122 | 0.214029 | 0.203149 | 0.151278 | 0.253708 | 0.322170 |
| 50% | 0.359843 | 0.417821 | 0.534719 | 0.326675 | 0.313929 | 0.318152 | 0.254079 | 0.308075 | 0.388523 | 0.270841 | ... | 0.292200 | 0.351243 | 0.222382 | 0.408873 | 0.293711 | 0.377013 | 0.361953 | 0.311285 | 0.383933 | 0.475763 |
| 75% | 0.536367 | 0.613179 | 0.673043 | 0.556967 | 0.562032 | 0.553586 | 0.488178 | 0.536970 | 0.577188 | 0.489455 | ... | 0.522997 | 0.502279 | 0.476768 | 0.603890 | 0.470218 | 0.539936 | 0.580241 | 0.554917 | 0.541228 | 0.630822 |
| max | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ... | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
| kurt | -0.266649 | -0.792941 | -0.632788 | -0.805408 | -0.642295 | -0.717871 | -0.478765 | -0.465351 | -0.712107 | -0.457036 | ... | -0.658801 | -0.813645 | -0.241984 | -0.706897 | -0.458951 | -0.693140 | -0.786762 | -0.736045 | -0.698586 | -0.585461 |
| skew | 0.384654 | 0.241072 | -0.472460 | 0.563743 | 0.701343 | 0.648831 | 0.860044 | 0.793675 | 0.278260 | 0.804556 | ... | 0.653671 | 0.482629 | 0.941977 | 0.269022 | 0.462910 | 0.305806 | 0.433698 | 0.631187 | 0.189090 | -0.054858 |
10 rows × 38 columns
| count | mean | std | min | 25% | 50% | 75% | max | kurt | skew | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 38.000000 | 38.000000 | 38.000000 | 38.000000 | 38.000000 | 38.000000 | 38.000000 | 38.0 | 38.000000 | 38.000000 |
| mean | 162.236842 | 0.358518 | 0.238675 | 0.007309 | 0.175019 | 0.317145 | 0.511566 | 1.0 | 0.032874 | 0.625580 |
| std | 2.329995 | 0.065575 | 0.033785 | 0.006930 | 0.068818 | 0.081184 | 0.091316 | 0.0 | 2.778367 | 0.436725 |
| min | 154.000000 | 0.205059 | 0.108660 | 0.000006 | 0.078003 | 0.165403 | 0.260870 | 1.0 | -0.919755 | -0.472460 |
| 25% | 163.000000 | 0.325524 | 0.222407 | 0.001240 | 0.112772 | 0.263898 | 0.478685 | 1.0 | -0.716430 | 0.385234 |
| 50% | 163.000000 | 0.373763 | 0.245262 | 0.005827 | 0.166412 | 0.312607 | 0.536668 | 1.0 | -0.603952 | 0.640009 |
| 75% | 163.000000 | 0.396099 | 0.265363 | 0.011376 | 0.218179 | 0.367425 | 0.561800 | 1.0 | -0.308914 | 0.828594 |
| max | 163.000000 | 0.501240 | 0.271564 | 0.027417 | 0.333668 | 0.534719 | 0.673043 | 1.0 | 16.352850 | 2.269299 |
| kurt | 7.841220 | 0.300250 | 4.780468 | 0.278195 | -0.457201 | 0.336602 | 0.646448 | 0.0 | 34.593107 | 4.948468 |
| skew | -2.982089 | -0.411377 | -1.846082 | 0.943842 | 0.608797 | 0.311415 | -0.925808 | 0.0 | 5.776424 | 1.021173 |
SocioEq Axis
| indicator | aged_15plus_unemployment_rate_percent | aged_15_24_unemployment_rate_percent | aged_25_54_unemployment_rate_percent | aged_55_64_unemployment_rate_percent | aged_65plus_unemployment_rate_percent | alternative_poverty_percent_below_nationally_defined_poverty | dollar_billionaires_per_million_people | gc_xpn_comp_zs | gc_xpn_totl_gd_zs | gini | ... | sl_emp_vuln_zs | sl_uem_1524_ne_zs | sl_uem_1524_zs | sl_uem_advn_zs | sl_uem_basc_zs | sl_uem_intm_zs | sl_uem_neet_zs | sl_uem_totl_ne_zs | sl_uem_totl_zs | number_of_people_in_poverty_popad |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 163.000000 | 163.000000 | 163.000000 | 163.000000 | 163.000000 | 161.000000 | 150.000000 | 163.000000 | 163.000000 | 163.000000 | ... | 163.000000 | 163.000000 | 163.000000 | 163.000000 | 163.000000 | 163.000000 | 163.000000 | 163.000000 | 163.000000 | 157.000000 |
| mean | 0.195372 | 0.206617 | 0.218393 | 0.252208 | 0.115085 | 0.316232 | 0.206813 | 0.331899 | 0.341799 | 0.320693 | ... | 0.424490 | 0.212287 | 0.224617 | 0.220810 | 0.277274 | 0.186628 | 0.303558 | 0.190362 | 0.198524 | 0.376792 |
| std | 0.228002 | 0.227229 | 0.225866 | 0.181878 | 0.124419 | 0.201892 | 0.130029 | 0.164133 | 0.169577 | 0.262715 | ... | 0.202961 | 0.231207 | 0.216013 | 0.164791 | 0.184035 | 0.213292 | 0.176713 | 0.228566 | 0.219666 | 0.184247 |
| min | 0.000626 | 0.000520 | 0.001590 | 0.004029 | 0.001351 | 0.000640 | 0.001796 | 0.002521 | 0.000097 | 0.000006 | ... | 0.012592 | 0.006359 | 0.001084 | 0.001369 | 0.043444 | 0.000028 | 0.007297 | 0.000800 | 0.001878 | 0.002428 |
| 25% | 0.059401 | 0.066029 | 0.083225 | 0.128230 | 0.040735 | 0.143978 | 0.134656 | 0.198146 | 0.220887 | 0.131583 | ... | 0.276640 | 0.063652 | 0.081407 | 0.087533 | 0.137568 | 0.055075 | 0.159773 | 0.059098 | 0.068768 | 0.245130 |
| 50% | 0.118145 | 0.120092 | 0.153682 | 0.229505 | 0.080014 | 0.302500 | 0.197668 | 0.366206 | 0.359548 | 0.267051 | ... | 0.415836 | 0.118475 | 0.152437 | 0.195214 | 0.241544 | 0.105750 | 0.307121 | 0.109840 | 0.125344 | 0.360404 |
| 75% | 0.208652 | 0.248242 | 0.234634 | 0.308487 | 0.138275 | 0.482666 | 0.263287 | 0.452417 | 0.461916 | 0.415025 | ... | 0.580125 | 0.252370 | 0.296813 | 0.317669 | 0.328199 | 0.213421 | 0.442947 | 0.195234 | 0.220081 | 0.500019 |
| max | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ... | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
| kurt | 4.130014 | 3.310943 | 3.349050 | 2.380585 | 15.734015 | 0.191899 | 8.507875 | 0.514136 | 0.375589 | 0.937436 | ... | -0.611322 | 2.875242 | 3.248694 | 2.478366 | 2.458553 | 3.768640 | 0.293624 | 4.249041 | 4.523826 | -0.087898 |
| skew | 2.144577 | 1.973368 | 1.962901 | 1.373183 | 3.076391 | 0.557092 | 1.884462 | 0.035223 | -0.061963 | 1.240230 | ... | -0.027471 | 1.858733 | 1.805913 | 1.202229 | 1.549685 | 2.060983 | 0.427655 | 2.193893 | 2.191538 | 0.410248 |
10 rows × 74 columns
| count | mean | std | min | 25% | 50% | 75% | max | kurt | skew | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 74.000000 | 74.000000 | 74.000000 | 74.000000 | 74.000000 | 74.000000 | 74.000000 | 7.400000e+01 | 74.000000 | 74.000000 |
| mean | 159.648649 | 0.282113 | 0.211983 | 0.003165 | 0.124772 | 0.246387 | 0.386763 | 1.000000e+00 | 2.539611 | 1.237931 |
| std | 5.698912 | 0.080772 | 0.040599 | 0.005649 | 0.057104 | 0.100945 | 0.128234 | 2.205185e-16 | 3.485573 | 0.854061 |
| min | 135.000000 | 0.115085 | 0.124419 | 0.000006 | 0.040038 | 0.080014 | 0.138275 | 1.000000e+00 | -1.216883 | -0.187507 |
| 25% | 161.000000 | 0.211355 | 0.182417 | 0.000685 | 0.083562 | 0.157435 | 0.263214 | 1.000000e+00 | 0.080992 | 0.557092 |
| 50% | 161.000000 | 0.276856 | 0.211765 | 0.001181 | 0.122690 | 0.240855 | 0.390125 | 1.000000e+00 | 0.964046 | 0.945012 |
| 75% | 163.000000 | 0.350291 | 0.237209 | 0.003358 | 0.152493 | 0.310664 | 0.496889 | 1.000000e+00 | 3.692507 | 1.944300 |
| max | 163.000000 | 0.468366 | 0.277576 | 0.043444 | 0.316980 | 0.514873 | 0.693074 | 1.000000e+00 | 15.734015 | 3.076391 |
| kurt | 6.225111 | -0.878806 | -0.811218 | 35.673451 | 1.211722 | -0.503881 | -1.045905 | 0.000000e+00 | 2.048308 | -0.798482 |
| skew | -2.454688 | -0.004652 | 0.004453 | 5.284968 | 0.954175 | 0.322402 | -0.087211 | 0.000000e+00 | 1.514686 | 0.434867 |
Education Axis
| indicator | literacy_rate_adult_total_percent_of_people_ages_15_and_above | literacy_rate_youth_total_percent_of_people_ages_15_24 | math_achievement_4th_grade | math_achievement_8th_grade | primary_completion_rate_total_percent_of_relevant_age_group | se_adt_1524_lt_zs | se_adt_litr_zs | se_com_durs | se_pre_durs | se_pre_enrl_tc_zs | ... | se_ter_enrl_tc_zs | se_ter_enrr | se_xpd_prim_pc_zs | se_xpd_seco_pc_zs | se_xpd_tert_pc_zs | se_xpd_totl_gb_zs | se_xpd_totl_gd_zs | children_out_of_school_primary_popad | se_prm_enrl_popad | se_prm_uner_popad |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 153.000000 | 153.000000 | 111.000000 | 116.000000 | 163.000000 | 161.000000 | 161.000000 | 163.000000 | 163.000000 | 163.000000 | ... | 161.000000 | 163.000000 | 163.000000 | 163.000000 | 161.000000 | 163.000000 | 163.000000 | 163.000000 | 163.000000 | 163.000000 |
| mean | 0.351622 | 0.324048 | 0.373550 | 0.393116 | 0.365935 | 0.318512 | 0.340287 | 0.200886 | 0.115349 | 0.316450 | ... | 0.123969 | 0.444570 | 0.352977 | 0.228024 | 0.291146 | 0.213918 | 0.265473 | 0.368577 | 0.341979 | 0.368577 |
| std | 0.249024 | 0.246990 | 0.213617 | 0.195716 | 0.233416 | 0.235629 | 0.243184 | 0.142063 | 0.104570 | 0.174655 | ... | 0.124497 | 0.195085 | 0.168029 | 0.172400 | 0.196316 | 0.154391 | 0.158531 | 0.227807 | 0.214003 | 0.227807 |
| min | 0.004441 | 0.004292 | 0.006881 | 0.003050 | 0.006206 | 0.003673 | 0.004355 | 0.001466 | 0.001093 | 0.007208 | ... | 0.002468 | 0.010008 | 0.006940 | 0.012510 | 0.002681 | 0.000795 | 0.000520 | 0.004969 | 0.007680 | 0.004969 |
| 25% | 0.159078 | 0.152203 | 0.178590 | 0.271094 | 0.204395 | 0.141218 | 0.168268 | 0.082392 | 0.058020 | 0.177081 | ... | 0.038054 | 0.316198 | 0.222811 | 0.103882 | 0.144753 | 0.107858 | 0.141828 | 0.216504 | 0.158868 | 0.216504 |
| 50% | 0.286044 | 0.258722 | 0.413078 | 0.424397 | 0.314043 | 0.255514 | 0.272202 | 0.198654 | 0.093390 | 0.309219 | ... | 0.081315 | 0.481448 | 0.391367 | 0.185800 | 0.253151 | 0.183077 | 0.259621 | 0.336537 | 0.320319 | 0.336537 |
| 75% | 0.523077 | 0.445717 | 0.540656 | 0.531733 | 0.455043 | 0.431918 | 0.484210 | 0.305787 | 0.141186 | 0.445114 | ... | 0.161218 | 0.579722 | 0.471964 | 0.296098 | 0.393026 | 0.314607 | 0.354241 | 0.456001 | 0.501769 | 0.456001 |
| max | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ... | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
| kurt | -0.266024 | 0.198997 | -0.367464 | 0.159116 | 0.034424 | 0.427056 | -0.044386 | 4.759635 | 31.338345 | 0.136529 | ... | 14.787486 | -0.391914 | 0.422806 | 2.163651 | 0.355587 | 5.699744 | 3.300066 | 0.323029 | -0.701179 | 0.323029 |
| skew | 0.727194 | 0.982363 | 0.014197 | -0.021008 | 0.809801 | 1.109369 | 0.858221 | 1.167725 | 4.247861 | 0.234525 | ... | 2.844381 | -0.401297 | 0.052837 | 1.349148 | 0.851095 | 1.753994 | 1.073131 | 0.876469 | 0.381986 | 0.876469 |
10 rows × 51 columns
| count | mean | std | min | 25% | 50% | 75% | max | kurt | skew | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 51.000000 | 51.000000 | 51.000000 | 51.000000 | 51.000000 | 51.000000 | 51.000000 | 5.100000e+01 | 51.000000 | 51.000000 |
| mean | 159.764706 | 0.320774 | 0.193596 | 0.004773 | 0.176476 | 0.306589 | 0.432982 | 1.000000e+00 | 2.616644 | 0.853574 |
| std | 10.025145 | 0.098849 | 0.035946 | 0.005282 | 0.078925 | 0.117311 | 0.131438 | 2.220446e-16 | 6.596536 | 1.044005 |
| min | 111.000000 | 0.103081 | 0.104570 | 0.000028 | 0.032839 | 0.070177 | 0.141186 | 1.000000e+00 | -1.067322 | -0.401297 |
| 25% | 162.000000 | 0.278309 | 0.171791 | 0.001082 | 0.134370 | 0.254333 | 0.373633 | 1.000000e+00 | -0.343966 | 0.119762 |
| 50% | 163.000000 | 0.351805 | 0.196939 | 0.003673 | 0.178526 | 0.320319 | 0.456001 | 1.000000e+00 | 0.159116 | 0.727194 |
| 75% | 163.000000 | 0.391477 | 0.220009 | 0.006910 | 0.234641 | 0.403719 | 0.529557 | 1.000000e+00 | 1.418931 | 1.091250 |
| max | 163.000000 | 0.487429 | 0.249024 | 0.026428 | 0.316198 | 0.497423 | 0.631175 | 1.000000e+00 | 31.338345 | 4.247861 |
| kurt | 17.751519 | -0.266164 | -0.072979 | 5.649457 | -0.819197 | -0.682860 | -0.281268 | 0.000000e+00 | 9.905711 | 2.344630 |
| skew | -4.183505 | -0.775891 | -0.693096 | 2.140930 | -0.161702 | -0.503942 | -0.834390 | 0.000000e+00 | 3.103911 | 1.549217 |
The tables above are summary stats of the summary stats of each "view" of the table of absolute values of r-scores. The views are selected columns (based on the label given them) and all the rows. So, the views show each axis of indicators' correlations to the rest of the table.
Just as in the table as a whole, each set of indicators only have a few members that even have a portion of their upper quartile of correlation values that show a moderately significant correlation to another indicator (see [max, 75%] == 0.7+/-). Also like the whole set of indicators, most indicators in each set have roughly half of their r-scores low enough to say there's no correlation. Looking at the skew and kurtosis columns (and rows), we see more variation between sets, meaning some axises of indicators may be more correlated to other axises of indicators than others. But, we have to keep in mind that these tables include self-correlations, and set sizes differ.
Let's separately correlate the axises with each other, and with themselves. We'll take a look at the axis correlations as a whole, then look at especially high and low correlations.
I'll use surface maps rather than heat maps to plot the tables. It allows the relationships to be quickly grokked by those who are color oriented as well as those who are spacially oriented. The maps are interactive, so you can rotate them to view from directly above or below and simulate a 2D heatmap. Unfortunately, though iplot takes axis label parameters, it is not including them in the output as of this execution.
# Get "views" of axis tables with selected axis rows to plot axis correlations.
# Democracy by Democracy
dem_dem_df = dem_df.loc[dem_ind_names_arr]
dem_dem_df.columns.name = 'dem_axis'
dem_dem_df.index.name = 'dem_axis'
# Democracry by SocEq
dem_soceq_df = dem_df.loc[soceq_ind_names_arr]
dem_soceq_df.columns.name = 'dem_axis'
dem_soceq_df.index.name = 'soceq_axis'
# Democracy by Education
dem_edu_df = dem_df.loc[edu_ind_names_arr]
dem_edu_df.columns.name = 'dem_axis'
dem_edu_df.index.name = 'edu_axis'
# SocEq by SocEq
soceq_soceq_df = soceq_df.loc[soceq_ind_names_arr]
soceq_soceq_df.columns.name = 'soceq_axis'
soceq_soceq_df.index.name = 'soceq_axis'
# SocEq by Education
soceq_edu_df = soceq_df.loc[edu_ind_names_arr]
soceq_edu_df.columns.name = 'soceq_axis'
soceq_edu_df.index.name = 'edu_axis'
# Education by Education
edu_edu_df = edu_df.loc[edu_ind_names_arr]
edu_edu_df.columns.name = 'edu_axis'
edu_edu_df.index.name = 'edu_axis'
dem_dem_df.iplot(kind='surface',
title='Democracy Indicators Correlated with Themselves',
xTitle='Dem Inds', yTitle='Dem Inds', zTitle='abs(r)')
dem_soceq_df.iplot(kind='surface',
title='Democracy Indicators Correlated with SocEq Indicators',
xTitle='SocEq Inds', yTitle='Dem Inds', zTitle='abs(r)')
dem_edu_df.iplot(kind='surface',
title='Democracy Indicators Correlated with Education Indicators',
xTitle='Edu Inds', yTitle='Dem Inds', zTitle='abs(r)')
soceq_soceq_df.iplot(kind='surface',
title='SocEq Indicators Correlated with Themselves',
xTitle='SocEq Inds', yTitle='SocEq Inds', zTitle='abs(r)')
soceq_edu_df.iplot(kind='surface',
title='SocEq Indicators Correlated with Education Indicators',
xTitle='Edu Inds', yTitle='SocEq Inds', zTitle='abs(r)')
edu_edu_df.iplot(kind='surface',
title='Education Indicators Correlated with Themselves',
xTitle='Edu Inds', yTitle='Edu Inds', zTitle='abs(r)')
Before we pull out some exemplar and exceptional indicator pairs and wrap up, there are some things to note about the surface maps above.
The democracy axis looks highly self-correlated overall. The construct seems fairly coherent and cohesive, even accross sources. It's worth noting that a few of these indicators are actually composites of some of the other indicators (e.g. demox_eiu). There are a couple of exceptional indicators that form a tic-tac-toe grid of low correlation.
The socioeconomic equity axis is largely not self-correlated, except along the axis of direct self-correlation and correlation among similar indicators from the same source (i.e. similarly prefixed names, thus alphabetically proximal). This low degree of self-correlation makes it hard to view the axis as a stable construct. There are a some patches of high correlation among definitively related indicators (e.g. between indicators of income distribution, or between indicators of unemployment and unemployment insurance). I would say that my loose and casual method of selecting these indicators and defining the axis is the major driver for such poor construction. This axis probably needs to be deconstructed as a set of axises in and of itself, which I'm not going to do.
The education axis is more self-correlated, but not as much as the democracy axis. It's criss-crossed with a number of low-correlating indicators. If I wanted to (de/re)construct this axis, I might start by pulling a list of these low-correlating indicators and looking more closely.
The democracy axis and socioeconomic equity axis are mostly not correlated, but there are some striking peaks and axises of high correlation, for instance between an indicator of government efficacy and an indicator of poverty. In fact measures of poverty and mean consumption seem to be fairly correlated to measures of democracy as a whole.
The democracy axis and education axis have some notable peaks and indicators. For instance: some measures of educational attainment and enrollment are moderately correlated with the democracy axis, no doubt partially due to gender equity being included as a measure of democracy, leading to more students as a whole; some measures of educational expenditure do not correlated well to the democracy axis. The basic welfare indicator from IDEA is fairly correlated to education indicators overall, which is not surprising since it includes a few measures of education.
The education and socioeconomic axises are similarly related, For instance, educational attainment and enrollment are correlated to overall income per capita and poverty indicators.
To wrap up, let's pull out some notable indicators and take a closer look. I'll grab the indicators with the highest and lowest correlations across axises. Then, I'll take a look at a few indicators that caught my interest in the surface maps above.
It looks like gender equity (gendereq_idea), government functioning (gvtx_eiu), and poverty (si_pov_mdim_xq) are all correlated.
First let's get that list of indicators.
# Pull the highest/lowest correlated indicators from each table that correlates
# across and axis.
def get_rows_and_cols_from_val(df: pd.DataFrame, val_thresh: float,
above: bool = True) -> set:
'''Return a set of row and column names that contain at least one value
above or below a given value.
Parameters:
df: pandas.DataFrame to search.
val_thresh: float value to compare field values to.
above: (optional) boolean to choose whether field values should be >=
or <= val_thresh. Default True selects >=.
Returns:
rows_cols_of_interest_set: set of row and column names containing at
at least one value that fits the criterion.'''
rows_cols_of_interest_set = set()
# Not necessary to declare here for proper scope in Python,
# but it bugs me not to.
mask = pd.DataFrame()
# Get dataframe with only the columns and rows with the values of interest.
if above:
mask = df[df >= val_thresh]
else:
mask = df[df <= val_thresh]
masked_df = df.loc[mask.any(axis=1), mask.any(axis=0)]
# Get a list of the rows and columns of interest.
rows_cols_of_interest_set.update(set(masked_df.index.values),
set(masked_df.columns.values))
return rows_cols_of_interest_set
# Find indicators that correlated strongly across axises.
r_thresh = 0.85
high_crossaxis_r_inds_set = set()
high_crossaxis_r_inds_set.update(get_rows_and_cols_from_val(df=dem_edu_df,
val_thresh=r_thresh),
get_rows_and_cols_from_val(df=dem_soceq_df,
val_thresh=r_thresh),
get_rows_and_cols_from_val(df=soceq_edu_df,
val_thresh=r_thresh))
print('Indicators that correlated highly across axises.')
high_crossaxis_r_inds_set
Indicators that correlated highly across axises.
{'bwelfr_idea',
'gendereq_idea',
'gvtx_eiu',
'se_sec_enrr',
'se_sec_nenr',
'se_ter_cuat_do_zs',
'se_xpd_totl_gb_zs',
'se_xpd_totl_gd_zs',
'si_pov_mdim_xq',
'si_spr_pc40',
'si_spr_pcap'}
# Find indicators that correlated weakly across axises.
r_thresh = 0.15
low_crossaxis_r_inds_set = set()
low_crossaxis_r_inds_set.update(get_rows_and_cols_from_val(df=dem_edu_df,
val_thresh=r_thresh,
above=False),
get_rows_and_cols_from_val(df=dem_soceq_df,
val_thresh=r_thresh,
above=False),
get_rows_and_cols_from_val(df=soceq_edu_df,
val_thresh=r_thresh,
above=False))
print('Indicators that do not correlate much across axises.')
len(low_crossaxis_r_inds_set)
Indicators that do not correlate much across axises.
163
Again, we have a lot more extreme absence of correlation across axises than we have high correlation. Every indicator is uncorrelated with at least one other indicator from a different axis, but only a handful are highly correlated with another correlated from a different axis
Let's look at our trio of indicators: gender equity (gendereq_idea), government functioning (gvtx_eiu), and poverty (si_pov_mdim_xq).
There is some leakage between the poverty indicator and the gender equality indicator; access to schooling is included in both indicators, and when half the population has limited or no access based on their gender, both of these indicators will be affected. But, educational access is only a single point in each of these indicators' constellation of sub-indicators. I'm not sure the educational access link is enough to explain a correlation as strong as we see below. There may be other underlying factors, and/or there may be a logical link between a empowering all of society rather than just half of it and the economic productivity and power of its members. I am not going to draw any statistical conclusions, though.
There is also some leakage between the gender equity indicator and the government functioning indicator. One of the 14 points of the government functioning indicator has to do with special groups, like religious groups that may be patriarchal, having "significant political power, parallel to democractic institutions." Again, I am not sure that point of overlap in indicators is enough to explain their correlation. In fact, three of the five points of the gender equality indicator are to do with inclusion of women in government and civil society organizations. In other words, the more women we see in government, the better it tends to run, and vice-versa. Again, I can't draw a statistical conclusion about causation.
Interestingly, I see no direct leakage between definitions of the government functioning indicator and the poverty indicator. A major underlying factor may be the resources and global economic context of the country in question. It takes national wealth to run a country well.
# pandas_profiling can be compute intensive and crash your browser.
# Uncomment and run this cell instead if needed to make scatter plots.
print('r:', cors_dfs_dict['Pears'].loc['gendereq_idea',
'si_pov_mdim_xq'])
px.scatter(data_frame=stacked_indicators_df,
x=stacked_indicators_df['gendereq_idea'],
y=stacked_indicators_df['si_pov_mdim_xq'],
labels={'x':'gendereq_idea', 'y':'si_pov_mdim_xq'})
print('r:', cors_dfs_dict['Pears'].loc['gendereq_idea',
'gvtx_eiu'])
px.scatter(data_frame=stacked_indicators_df,
x=stacked_indicators_df['gendereq_idea'],
y=stacked_indicators_df['gvtx_eiu'],
labels={'x':'gendereq_idea', 'y':'gvtx_eiu'})
print('r:', cors_dfs_dict['Pears'].loc['si_pov_mdim_xq',
'gvtx_eiu'])
px.scatter(data_frame=stacked_indicators_df,
x=stacked_indicators_df['si_pov_mdim_xq'],
y=stacked_indicators_df['gvtx_eiu'],
labels={'x':'si_pov_mdim_xq', 'y':'gvtx_eiu'})
r: -0.8581766183494676
r: 0.6782802268577326
r: -0.8690448989326301
inds_of_interest_df = stacked_indicators_df[['gendereq_idea', 'gvtx_eiu',
'si_pov_mdim_xq']]
inds_of_interest_df.dropna(axis=0, how='all', inplace=True)
inds_of_interest_profile = ProfileReport(inds_of_interest_df,
title='Gender Equity, Government Efficacy, and Poverty')
# Note you will likely need to click out of this cell and re-run it to produce
# the widgets.
inds_of_interest_profile.to_widgets()
I grabbed baskets of indicators relating to democracy, education, and socioeconomic equity. I found that mostly there was a lack of correlation between these three axises, contrary to my expectation. There some notable correlations, however, including between gender equity, government efficacy, and poverty.
There were profound problems with my casual data selection and grouping method (if you could call it a method). Not the least of these problems was the fact that many of the indicators overlapped by definition with indicators in a separate axis (i.e. democracy, education, equity). Democracy seems to have become a far-reaching construct incorporating so many aspects of a healthy society as to reveal the paradigmatic bias of the society that constructed it. Stable though it is in its self-correlation, the definition of democracy might need to be narrowed to methods and structures of political power and government function in order to draw more meaningful correlations between it and other constructs like education and socioeconomic equity that currently overlap with democracy, causing leakage in the correlations. My casual choice of indicators of socioeconomic equity was apparent in the way the axis was largely not self-correlated.
I made heavy use of pandas data structures and operations. Over the course of this project I 'leveled up' my Python coding, especially with regards to functional programming.
I also test drove cufflinks and pandas_profiling. I found them both to be buggy. When pandas_profiling didn't crash my browser, it often omitted many values and graphics.
I made heavy use of pandas, numpy, and Python documentation, as well as Stack Overflow, Geeks for Geeks, Real Python, and other instructional blogs, too many to count and too inconsequential to include. I did not copy any code nor make use of any unique or proprietary ideas. I only gleaned how to code better and solve my problems. To cite them would be like citing books about the English language that I have read simply because I am writing in English and using words I first saw in them.